Facual Recognition Technology

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 36, NO.

2, APRIL 2006 433

Dynamics of Facial Expression: Recognition of


Facial Actions and Their Temporal Segments
From Face Profile Image Sequences
Maja Pantic, Member, IEEE, and Ioannis Patras, Member, IEEE

Abstract—Automatic analysis of human facial expression is a pends only for 7% on the spoken word, for 38% on vocal utter-
challenging problem with many applications. Most of the existing ances, while facial expressions determine 55% of this feeling.
automated systems for facial expression analysis attempt to rec- Finally, facial expressions are our direct and naturally preemi-
ognize a few prototypic emotional expressions, such as anger and
nent means of communicating emotions [1], [3]. Hence, facial
happiness. Instead of representing another approach to machine
analysis of prototypic facial expressions of emotion, the method expressions play a very important role in human face-to-face in-
presented in this paper attempts to handle a large range of human terpersonal interaction. Automatic analysis of facial expressions
facial behavior by recognizing facial muscle actions that produce would, therefore, be highly beneficial for fields as diverse as
expressions. Virtually all of the existing vision systems for facial behavioral science, psychology, medicine, security, education,
muscle action detection deal only with frontal-view face images and and computer science (facilitating lip reading, face and visual
cannot handle temporal dynamics of facial actions. In this paper, speech synthesis, videoconferencing, affective computing, and
we present a system for automatic recognition of facial action units
(AUs) and their temporal models from long, profile-view face image anticipatory human-machine interfaces). It is this wide range of
sequences. We exploit particle filtering to track 15 facial points in applications that has produced a surge of interest in machine
an input face-profile sequence, and we introduce facial-action-dy- analysis of facial expressions.
namics recognition from continuous video input using temporal Most of the facial expression analyzers developed so far at-
rules. The algorithm performs both automatic segmentation of an tempt to recognize a small set of prototypic emotional facial ex-
input video into facial expressions pictured and recognition of tem- pressions, i.e., fear, sadness, disgust, anger, surprise, and happi-
poral segments (i.e., onset, apex, offset) of 27 AUs occurring alone
or in a combination in the input face-profile video. A recognition ness (e.g., [4]–[8]; for an exhaustive survey, see [9]). This prac-
rate of 87% is achieved. tice may follow from the large body of psychological research
(from Darwin [10] to Ekman [3], [11]) which argues that these
Index Terms—Computer vision, facial action units, facial expres-
sion analysis, facial expression dynamics analysis, particle filtering, “basic” emotions have corresponding prototypic facial displays.
rule-based reasoning, spatial reasoning, temporal reasoning. However, there is also a growing body of psychological re-
search that argues that it is not prototypic expressions but some
components of those expressions (e.g., “squared” mouth, raised
I. INTRODUCTION eyebrows, etc.) which are commonly displayed and universally
linked with the emotion labels listed above [1], [12]. To de-
T HE human face is involved in a large variety of different
activities. It houses the apparatus for speech production as
well as the majority of our sensors (eyes, nose, mouth). Besides
tect such subtle facial expressions and to make the facial ex-
pression information available for usage in the various applica-
these biological functions, the human face provides a number tions mentioned above, automatic recognition of facial muscle
of social signals essential for our public life. The face mediates actions (i.e., atomic facial signals) is needed.
person identification, attractiveness, and facial communicative
cues, that is, facial expressions. Our utterances are accompa- A. Facial Action Coding System (FACS)
nied by the appropriate facial expressions, which clarify what There are several methods for measuring and describing fa-
is said and whether it is supposed to be important, funny or cial muscular activity [13]. From these, the FACS is the most
serious. Facial expressions reveal our current focus of atten- widely used method in psychological research [13]. Ekman and
tion, synchronize the dialogue, signal comprehension or dis- Friesen developed the original FACS in the 1970s by deter-
agreement; in brief, they regulate our interactions with the en- mining how the contraction of each facial muscle (singly and in
vironment and other persons in our vicinity [1]. As indicated by combination with other muscles) changes the appearance of the
Mehrabian [2], whether the listener feels liked or disliked de- face. They examined videotapes of facial behavior to identify
specific changes that occur with muscular contractions and how
Manuscript received December 15, 2004; revised June 10, 2004. This work to differentiate one from another. They associated the facial ap-
was supported by the Netherlands Organization for Scientific Research (NWO) pearance changes with the action of muscles that produce them.
under Grant EW-639.021.202. This paper was recommended by Associate Ed-
itor S. Sarkar.
Namely, the changes in the facial expression are described with
M. Pantic is with Delft University of Technology, Electrical Engi- FACS in terms of 44 different action units (AUs), each of which
neering, Mathematics and Computer Science, The Netherlands (e-mail: is anatomically related to the contraction of either a specific fa-
[email protected]). cial muscle or of a set of facial muscles. Along with the defini-
I. Patras is with the Department of Computer Science, The University of York,
York Y010 5DD, U.K. (e-mail: [email protected]). tion of various AUs, FACS also provides the rules for visual de-
Digital Object Identifier 10.1109/TSMCB.2005.859075 tection of AUs and their temporal segments (onset, apex, offset)
1083-4419/$20.00 © 2006 IEEE
434 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 36, NO. 2, APRIL 2006

in a video of the observed face. Using these rules, a FACS coder cannot code temporal segments (i.e., onset, apex, offset) of AUs
(i.e., a human observer having a formal training in using FACS) [19], the research reported here addresses the problem of auto-
“dissects” a shown facial expression, decomposing it into the matic detection of AUs and their temporal segments from pro-
specific AUs and their temporal segments that produced the ex- file-view face image sequences. It was undertaken with the fol-
pression. The FACS Manual was first published in 1978 [14]. lowing motivations.
The latest version was published in 2002 [15]. 1) In a frontal-view face image, facial actions such as tongue
pushed under the upper lip (AU36t) or pushing the jaw for-
B. Automated FACS: Frontal Face ward (AU29) represent out-of-plane nonrigid movements
that are difficult to detect. Such facial actions are clearly
Although FACS provides a good foundation for AU coding of observable in a profile view. Hence, the use of face-profile
face images by human observers, automatic recognition of AUs view promises a qualitative enhancement of AU detection
by computers remains difficult. One problem is that AUs can performed (by enabling detection of AUs that are difficult
occur in more than 7000 different combinations [13], causing
to encode in a frontal view).
bulges (e.g., by the tongue pushed under one of the lips) and var-
2) Existing AU detectors achieve good recognition rates, but
ious in-and out-of-plane movements of facial components (e.g.,
virtually all of them perform well only when the user
jetted jaw) that are difficult to detect in 2D face images. Few
faces the camera and does not change his/her three-dimen-
methods have been reported for automatic AU detection in face
image sequences [16]. Some researchers described patterns of sional (3-D) head pose. Robust AU detection, independent
facial motion that correspond to a few specific AUs but did not of rigid head movements that can cause changes in the
report on actual recognition of these AUs (e.g., [4]–[6], [8], [17], viewing angle and the visibility of the tracked face and its
[18]). Only recently there has been an emergence of efforts to- features, is yet to be attained. Perhaps the most promising
ward automatic analysis of facial expressions into elementary method for achieving this aim is through the use of mul-
AUs [19]. For instance, the Machine Perception group at UCSD tiple cameras yielding multiple views of the face [27]. For
has proposed several methods for automatic AU coding of fa- example, the system could be trained using triplets of im-
cial expressions. To detect 6 individual AUs in face image se- ages per facial expression to be recognized shown at three
quences free of head motions, Bartlett et al. [20] used a 61 orientations that differ by a rotation of 90 (portrait, left
10 feed-forward neural network. They achieved 91% ac- and right profile). Novel rotations at 30 and 45 from the
curacy by feeding the pertinent network with the results of a nearest trained orientation can be interpolated between the
hybrid system combining holistic spatial analysis and optical trained orientations. Test images of facial displays shown
flow with local feature analysis. To recognize eight individual at any orientation between the left and the right profile
AUs and four combinations of AUs in face image sequences free view of the face could be finally classified by general-
of head motions, Donato et al. [21] used Gabor wavelet repre- izing from independent facial expression representations
sentation and independent component analysis. They reported a at each training/interpolated facial view. A basic under-
95.5% average recognition rate achieved by their method. The standing of how to achieve automatic AU detection from
most recent work by Bartlett et al. [22] reports on accurate au- the profile view of the face is necessary if such a techno-
tomatic recognition of 18 AUs (95% average recognition rate) logical framework for automatic AU detection from mul-
from near frontal-view face image sequences using Gabor fil- tiple views of the face is to be established.
ters and Support Vector Machines. Another group that has fo- 3) There is now a growing body of psychological research
cused on automatic FACS coding of face image sequences is that argues that temporal dynamics of facial behavior (i.e.,
that led by Cohn and Kanade. To recognize eight individual AUs the timing and the duration of facial activity) is a crit-
and seven combinations of AUs in face image sequences free of ical factor for the interpretation of the observed behavior
head motions, Cohn et al. [23] used facial feature point tracking [1]. For example, Schmidt and Cohn [28] have shown that
and discriminant function analysis and achieved an 85% av- spontaneous smiles, in contrast to posed smiles, are fast in
erage recognition rate. Tian et al. [24] used lip tracking, tem-
onset, can have multiple AU12 apexes (i.e., multiple rises
plate matching and neural networks to recognize 16 AUs oc-
of the mouth corners), and are accompanied by other AUs
curring alone or in combination in near frontal-view face image
that appear either simultaneously with AU12 or follow
sequences. They reported an 87.9% average recognition rate.
The authors’ group also reported on multiple efforts toward AU12 within 1 s. Since it takes more than one hour to
automatic analysis of facial expressions into atomic facial ac- manually score 100 still images or a minute of videotape
tions. The majority of this previous work concerns automatic in terms of AUs and their temporal segments [14], it is
AU recognition in static face images [7], [25]. Only recently, obvious that automated tools for the detection of AUs and
the authors’ group has focused on automatic FACS coding of their temporal dynamics would be highly beneficial. Nev-
face video. To recognize 15 AUs occurring alone or in combi- ertheless no effort in automating the detection of the tem-
nation in near frontal-view face image sequences, Valstar et al. poral segments of AUs in face image sequences has been
[26] used temporal templates (i.e., motion history images) and a reported so far.
combined k-Nearest-Neighbor and rule-based classifier. An av- 4) Areas where machine tools for the analysis of human fa-
erage recognition rate of 65% was reported. cial expressions from face profile could expand and en-
hance research include numerous specialized areas in sci-
entific and professional sectors. Automatic analysis of
C. Automated FACS: Profile Face
expressions from face-profile view would facilitate re-
In contrast to these previous approaches to automatic AU search on human emotion, which is in turn important for
detection, which deal only with frontal-view face images and areas such as behavioral science, psychology, neurology
PANTIC AND PATRAS: DYNAMICS OF FACIAL EXPRESSION 435

Fig. 1. Outline of the profile-face-based method for detection of AUs and their temporal dynamics.

(in studies on dependence between emotional abilities im- in the contour of the face profile tracked in an input face-profile
pairments and brain lesions), and psychiatry (in studies image sequence. This previous version of the profile-face-based
on autism and schizophrenia) [29]. It seems that negative AU detector had several limitations: 1) it was applicable only
emotions (where facial displays of AU2, AU4, AU9, etc., to images depicting the left profile of the face; 2) it did not
are often involved) are more easily perceivable from the apply temporal reasoning; 3) it could not recognize temporal dy-
left hemiface and the full face than from the right hemi- namics of AUs; and 4) AU coding was based only upon changes
face and that, in general, the left hemiface is perceived to in the contour of the face profile region (i.e., changes within the
display more emotion than the right hemiface [30]. Also, face profile region were disregarded).
it seems that facial actions involved in spontaneous emo- The current version of the method, proposed in this paper, ad-
tional expressions are more symmetrical, involving both dresses these limitations. Fig. 1 outlines this novel method, the
the left and the right side of the face, than deliberate ac- prelim of which was reported in [35]. It operates under two as-
tions displayed on request [31]. Based upon these obser- sumptions: 1) the input video sequence is a nonoccluded (left or
vations, Mitra and Liu [32] have shown that facial asym- right) near-profile view of the face with possible in-image-plane
metry has sufficient discriminating power to improve the head rotations and 2) the first frame of it shows a neutral ex-
performance of an automated emotion classifier signifi- pression. After the facial points are initialized in the first frame
cantly. Martinez [33] has shown that, by taking into ac- of the input image sequence, we exploit particle filtering to
count facial asymmetry caused by certain emotion, ex- track the 15 points automatically in the rest of the sequence.
pression-invariant face recognition can be achieved. Fi- Based on the changes in the position of the points, we measure
nally, machine analysis of facial behavior from profile ex- changes in facial expression. Changes in the position of the fa-
pressions could be of considerable value in any situation cial points are first transformed into a set of mid-level param-
where issues concerning emotion, attention, deception, eters for AU recognition. Based upon the temporal consistency
and attitude are of importance and frontal-face observa- of these parameters, a rule-based method encodes temporal seg-
tions are not always feasible. Such situations occur often ments (onset, apex, offset) of 27 AUs occurring alone or in com-
in security sectors, where the observed persons should not bination. The usage of temporal information allows us not only
be aware of the video surveillance. to code a video segment to the corresponding AUs, but also to
The authors have already built a first prototype of an auto- automatically segment an arbitrarily long video sequence into
mated profile-face-based AU detector [34], the novel version the segments that correspond to different expressions. Facial
of which is presented in this paper. This prototype system was point tracking, parametric representation of the extracted infor-
aimed at automatic recognition of 20 AUs from subtle changes mation, recognition of AUs and their dynamics, and automatic
436 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 36, NO. 2, APRIL 2006

transformation for each frame and based on it we register the


current frame to the first frame of the sequence. In order to esti-
mate the global affine transformation, we track three referential
points. These are: the top of the forehead (P1), the tip of the nose
(P4), and the ear canal entrance (P15). We use these points as the
referential points because of their stability with respect to non-
rigid facial movements: contractions of facial muscles do not
cause physical displacement of these points [44]. We estimate
the global affine transformation as the one that minimizes the
distance (in the least-squares sense) between the -based pro-
jection of the tracked locations of the referential points and these
locations in the first frame of the sequence. The rest of the facial
features are tracked in image frames that have been compen-
sated for the transformation . In what follows, without loss of
generality, we will describe the proposed color-based tracking
scheme for tracking a single facial feature.
Fig. 2. Facial points (fiducial points of the face components).
A. Auxiliary Particle Filtering
segmentation of the video sequence are explained in Sections II, The tracking is initialized in the first frame of the input image
III, IV, and V. Evaluation studies and experimental results are sequence when a window is centered around the facial feature
discussed in Section VI. to be tracked. Let denote the template that contains the color
information in such a window. We will use to denote the un-
II. FACIAL POINT TRACKING known location of the facial feature at the current time instant
and will denote the observations (i.e., the
Contractions of facial muscles induce movements of the fa-
images) up to the current time instant. In order to fully specify a
cial skin and changes in the appearance of facial features (fa-
particle filter, we need to model two probability densities. One
cial components) such as the eyebrows, nose, and mouth. Their
is the observation likelihood , which expresses in our
shape and location, as visible in a face profile, can alter im-
case how similar the color information in a window in image
mensely with facial expressions (e.g., pursed lips versus jaw
around the position is to the color template . The second den-
dropped). To be able to reason about the shown expression and
sity is the transition density which, in our case, models
the facial muscle actions that produced it, one must first de-
the temporal dynamics of the facial feature. That is,
tect the current appearance of the facial features. To do so, we
models the probability that the facial feature is at position in
track a set of facial points illustrated in Fig. 2, the locations
the current frame, given that it was at position in the pre-
of which alter as the current appearance of the facial features
vious frame.
changes with the facial expression. In this paper, we do not ad-
dress the problem of initially locating the facial points. We as- The main idea of particle filtering is to maintain a particle-
sume that they are initialized either manually or automatically in based representation of the a posteriori probability of
the first frame of the input face image sequence (e.g., using the the state given all the observations up to the current time
method proposed in [25] and/or in [36]) and they are automati- instance. This means that the distribution is represented
cally tracked for the rest of the sequence by applying a particle by a set of pairs such that if is chosen with proba-
filtering method. bility equal to , then it is as if is drawn from . That
In recent years, particle filtering has been the dominant par- is [40], the probability is approximated by the discrete
adigm for tracking the state of a temporal event given a set distribution , where is the Dirac function
of noisy observations up to the current and . Let the particles be sampled from a sam-
time instant [37]–[43]. In our case, the state is the location pling distribution which has a positive probability density
of a facial fiducial point while set is the function that (up to a normalization constant) is equal to a func-
set of image frames up to the current time instant. The main tion . Then calculate the weights as ,
idea behind particle filtering is to maintain a set of solutions where . It can be shown that, if the pairs
that are an efficient representation of the conditional probability are chosen in this way then, as the number of parti-
. By maintaining a set of solutions instead of a single es- cles approaches infinity, an estimation converges
timate (as is done by Kalman filtering, for example), particle to the expected (under the distribution ) value of the
filtering is able to track multimodal conditional probabilities function . Therefore, once a particle-based representation
, and it is therefore robust to missing and inaccurate data of the a posteriori probability is available, we can esti-
and particularly attractive for estimation and prediction in non- mate statistics such as the mean ( ) and the variance
linear, non-Gaussian systems. In this paper, we adapt the aux- ( ) of the state. In our case, an estimation of the the
iliary particle filtering method that was introduced by Pitt and position of the facial feature is obtained as the mean of the state
Shepard [39] to independently track the location of the 15 fa- , that is
cial features depicted in Fig. 2. In order to make the tracking
robust to in-plane head rotations and translations as well as to (1)
small translations along the z-axis, we estimate a global affine
PANTIC AND PATRAS: DYNAMICS OF FACIAL EXPRESSION 437

In the particle filtering framework, our representation of the


a posteriori probability of the state is updated in a
recursive way. More specifically, let us assume that at the current
time instant we have a particle-based representation of the a
posteriori probability of the state at the previous
time instant. That is, let us assume that we have a collection of
particles and their corresponding weights (i.e., )
that represent the a posteriori probability at the previous time
instant. Then, we can summarize a step of the Auxiliary Particle
Filtering that will result in a collection of particles and their
corresponding weights (i.e., ) that represent the a
posteriori probability at the current time instant as follows.
1) Propagate all particles via the transition probability
in order to arrive at a collection of particles
.
2) Evaluate the likelihood associated with each particle , Fig. 3. Outline of the auxiliary particle filtering method [39].
that is, let .
3) Draw particles from the probability density that is rep-
resented by the collection (see Fig. 4, lower around the position is to the color template . Note that we
left plot). Let be the index of the particle that was drawn need an observation model that given an image , a position
at the draw ( ), that is, let the particle and a color template can evaluate the scalar value .
be selected at the draw (in general ). This is the The transition density is used at steps 1 and 4. Its role is
essence of the auxiliary particle filtering; it favors parti- to propagate a particle from the previous frame to a position
cles with high (i.e., particles that end up in areas with in which is likely to be in the current frame. Note that we
high likelihood when propagated with the transition den- need a transition model from which we can sample, that is, a
sity). model that, given a particle , can produce a particle with a
4) Propagate each of the particles that were draw at step 3 probability equal to . In what follows we will formally
with the transition probability in order to arrive define the two density models that we use.
at a collection of particles .
5) Assign a weight to each particle according to (2) B. Robust Color-Based Observation Model
Various observation models have been proposed for template-
(2) based tracking, where special attention is given to both the ro-
bustness in the presence of clutter and occlusions and the adapta-
tion of the observation model [45], [46]. Recently, attention has
This results in a collection of particles and their corre- been drawn to color-based tracking [42], [47]. In what follows,
sponding weights (i.e., ). This representation we propose a color-based observation model that is invariant to
is an approximation of the density .1 global illumination changes.
An outline of the auxiliary particle filtering algorithm, in Our observation model is initialized in the first frame of the
which the steps of the algorithm are visually depicted, is given input image sequence when the user centers a window around
in Fig. 3. At each subfigure, a set of circles depicts a set of par- the facial point to be tracked. Let denote the template fea-
ticles, where larger circles depict particles with higher weights. ture vector that contains the RGB color information in such a
At each subfigure, the continuous line depicts the probability window in the first frame and let denote the color at a pixel
density function that is represented by the corresponding set of . Clearly, has a dimensionality equal to three times the number
particles. In addition, in the top-right subfigure the observation of pixels in the window. We need to define the probability den-
likelihood is depicted with a dashed line. Note that sity . Let denote the data vector that contains the
the horizontal axes of the two plots at the left represent the RGB color information at the image window around position
state (i.e. the state at the previous time instant) while the and let denote the color at a pixel . We propose a
horizontal axes of the two plots at the right depict the state color-based distance between the vectors and that is in-
at the current time instant. variant to global changes in the intensity. For each pixel , the
We proceed by modeling the observation likelihood color distance is defined as
and the transition density . The observation likelihood
is used at steps 2 and 5 and its role is to assign higher weights
to particles according to how similar the color information (3)

1To be more specific, with the above scheme, at step 4 we arrive at a collection
of pairs (s ; k ) (i.e. a particle s and the index k of the particle that was
where is the mean of vector (i.e., the average intensity
drawn at the k draw at step 3), each one of which is sampled from a probability of the color template ) and is the mean of vector
G( ; k) with density proportional to g( ; k) = p(y j )p( js ) . With the (i.e., the average intensity of the color template ). By di-
weighting of step 5, the set f((s ; k ); w )g is a particle-based representation viding the color at each pixel with the average intensity of the
of p( ; k jY ) which up to proportionality is p(y j )p( js ) . By dropping
the index k from the above set f((s ; k ); w )g, we arrive at a particle-based color template to which it belongs, the color difference vector
representation of p( jY ). See [39] for a complete proof. becomes invariant to changes in the illumination
438 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 36, NO. 2, APRIL 2006

intensity. Finally, we define the scalar color distance using a ro- to test the performance of the system as a whole) showing var-
bust function [48]. More specifically ious facial expressions. The parameters of the transition model
of each facial feature are estimated independently of the transi-
(4) tion models of the other facial features.

D. Tracking Multiple Facial Points


where is the norm and is the absolute value in our
experiments. Then, the observation likelihood is The application of auxiliary particle filtering for tracking the
position of each facial point results in a set of particles and
(5) their corresponding weights in each frame of the se-
quence. This set is a representation of the posterior . An
where Z is a normalization term that can be ignored since in estimate of the position of the facial point is then given by (1).
the context of particle filtering the weights of the particles are Typical results of this algorithm are illustrated in Figs. 4 and
renormalized at each iteration [see (2)]. The term is a scaling 5. Finally, let us note that the computational complexity of the
parameter which was set to 0.01 in all of our experiments. above algorithm is linear with respect to the number of parti-
cles and to the number of facial points. The main computational
C. Transition Model burdens of the algorithm are the evaluation of the likelihood
of the particles and the calculation of the color dis-
Once the observation model has been defined, we need to tance in (4). In our experiments, we used 100 particles for each
model the transition probability that is used to generate a new of the 15 points which, for our Matlab code, resulted in the pro-
set of particles given the current one. The transition probability cessing of 1 frame per 15 s on a 2.5-GHz Pentium. We expect
models our knowledge of the dynamics of the feature, that is, it that a careful C/C++ implementation can achieve a near-real
models our knowledge of the feature’s position in the current time performance.
frame given its position in the previous frame. We model the
transition probability of each feature as a mixture of Gaus-
III. MID-LEVEL PARAMETRIC REPRESENTATION
sians. The first few components model the feature’s dynamics
as a mixture of Gaussians around the previous position . The Contractions of facial muscles alter the shape and location
last few components of the Gaussian mixture ignore the infor- of the facial components (eyebrows, eyes, mouth, chin). Some
mation about the position in the previous frame and model the of these changes in facial expression are observable from the
static prior . These last components are essentially used to changes in the position of the tracked points. To classify the
recover the tracking by creating particles at positions with high tracked changes in terms of AUs, these changes are transformed
priors, such as the position of the facial points in the expression- first into a set of mid-level parameters. We have defined two
less face. More specifically mid-level parameters in total: and .
1) Parameter , where 1
stands for the 1st frame and for the current frame, de-
scribes upward and downward movements of point . If
, point moves upwards. If
, point moves downwards. The value of
(6)
is the y-coordinate of point and the value of is 1 pixel.
2) Parameter , where
where the coefficient is set to 95%. This means that 95% stands for the 1st frame and for the current frame, de-
of the samples are generated based on the feature’s dynamics. scribes the increase or decrease of the distance between
The number of the Gaussians to be used for each feature is a de- points and . If , distance
sign choice, which depends on the degree of freedom of each increases. If , distance decreases.
facial feature. In our implementation, we used a very simple Distance is calculated as the Euclidian distance
model with 1 or 2 Gaussians for the dynamic components (i.e., between and .
1 2) and 1 to 3 Gaussians (i.e., 2 5) for the static Originally, in the first prototype of our profile-face-based AU
components. To obtain valid static components, we first need detector [34], we used another parameter as well. The parameter
to compensate for head motion using the global transformation in question, , describes inward
. Then, we need to compensate for physiognomic variability. and outward movements of point . For a left-profile view of
Namely, different people have different faces, and the facial fea- the face, describes an inward movement of
tures are not located at exactly the same position in each face. point . On the other hand, for a right-profile view of the face,
We handle this by translating the mean of each Gaussian com- describes an outward movement of point .
ponent of the second term of (6) by a vector estimated based Thus, this parameter depends upon the facial view depicted in
on the location of the facial feature in the first frame (neu- the input image, which is the main reason why we chose not
tral expression frame) of the input image sequence. The means to use this parameter in the current version of our profile-face-
and the variances of the components of the Gaussians are esti- based AU detector. In the current version of the system we use
mated using the EM algorithm on a semi-automatically anno- instead. We represent an outward movement
tated training dataset containing the coordinates of the facial of point as 15 (i.e., as an increase of
features under consideration. This dataset contains images of the distance between points P15 and ). Similarly, an inward
two persons (other than the 19 persons whose images are used movement of point is represented as 15
PANTIC AND PATRAS: DYNAMICS OF FACIAL EXPRESSION 439

+ + + +
Fig. 4. Results of the facial point tracking. First and second rows: frames 1 (neutral), 14 (blink, i.e., apex AU45), 75 (onset AU1 2 5), 89 (apex AU1 2 5),
+ + + +
131 (apex AU1 2 45), 137 (offset AU1 2 45), 148 (neutral). 3rd and 4th rows: frames 1 (neutral), 19 (onset AU36 T+ 26), 23 (apex AU45, onset
AU36 T+ 26), 38 (apex AU36 T+ T+
26), 76 (offset AU36 + + +
26), 159 (apex AU4, onset AU17 24), 194 (apex AU4 17 24), 237 (offset AU4 17 24). + +

(i.e., as an decrease of the distance between points P15 and ). due to the movement of the relevant points or due to a rigid head
As explained in Section II, we use point P15 (Fig. 2) as the movement. As explained in Section II, to handle in-image-plane
referential point because it is a stable facial point. rotations and variations in scale of the observed face profile, we
As can be seen from these definitions, the mid-level parameters register each frame of the input image sequence using a global
are calculated for various points for each input frame by com- affine transformation that we estimate for each frame based on
paring the position of the points in the current frame with that of the tracked location of P1, P4, and P15. The feature parameters
the relevant points in the first (neutral expression) frame. Before are then calculated for various points tracked in each frame of the
these calculations can be carried out, all rigid head motions in the registered sequence. As can be seen in Fig. 6, the values of the
input image sequence must be eliminated. Otherwise we would mid-level parameters change as a function of time and, thus, can
not be certain whether the value of a given parameter had changed be used to measure temporal dynamics of AUs.
440 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 36, NO. 2, APRIL 2006

Fig. 5. Results of the facial point tracking. First row: frames 1 (neutral), 48 (onset AU29), 59 (apex AU29), 72 (offset AU29). Second row: frames 1 (neutral), 17
+ + + + + + + + + + + +
(onset AU44 9 10 20 25), 25 (apex AU44 9 10 20 25), 66 (apex AU45, offset AU44 9 10 20 25). Third row: frames 1 (neutral),
B+
12 (onset AU36 26), 43 (apex AU36 B+ B+
26), 62 (offset AU36 +
26). Fourth row: frames 1 (neutral), 25 (onset AU12), 30 (onset AU6 12), 55 (apex
+ + +
AU6 12 25 45).

IV. RECOGNITION OF AUS AND THEIR TEMPORAL DYNAMICS of shown AUs, we consider the temporal consistency of the mid-
level parameters.
We transform the calculated mid-level feature parameters into We divide each facial action into three temporal segments:
a set of AUs describing the facial expression(s) captured in the the onset (beginning), apex (peak), and offset (ending). We de-
input video. We use a set of temporal rules and a fast direct fine each temporal rule for AU recognition in a unique way ac-
chaining inference procedure to encode 27 AUs occurring alone cording to the relevant FACS rule and using the mid-level pa-
or in combination in an input face-profile image sequence. To rameters explained in Section III. Tables I–III provide the list
minimize the effects of noise and inaccuracies in facial point of the utilized rules. Fig. 6 illustrates the meaning of these rules
tracking and to enable the recognition of the temporal dynamics for the case of AU1, AU2, and AU12. In Fig. 6, the horizontal
PANTIC AND PATRAS: DYNAMICS OF FACIAL EXPRESSION 441

Fig. 6. Values of four mid-level feature parameters (in left-to-right order): up=down(P12) and up=down(P11) computed for 163 frames of AU1 + 2 + 5
face-profile video depicted in the first two rows of Fig. 4, and up=down(P7) and inc=dec(P5P7) computed for 92 frames of AU6 + 12 + 25 face-profile video
depicted in the fourth row of Fig. 5.

axis represents the time dimension (i.e., the frame number) and of R-list achieves a relational representation of the knowledge.
the vertical axis represents values that the mid-level feature pa- The R-list is a 4-tuple list, where the first two columns identify
rameters take. As implicitly suggested by the two left-hand-side the conclusion of a certain rule that forms the premise of an-
graphs of Fig. 6, P12 (respectively P11) should move upward other rule identified in the next two columns of the R-list. For
and it should be above its neutral-expression location to label example, the relation between rules 13 and 21 (Table II) is rep-
a frame as an “AU1 (respectively AU2)2 onset”. The upward resented as (21, 1, 13, 2), which means that the 1st conclusion
motion should terminate, resulting in a (relatively) stable tem- of rule 21 forms the 2nd premise of rule 13. The term direct in-
poral location of P12 (P11), for a frame to be labeled as “AU1 dicates that as the inference process is executing, it creates the
(AU2) apex”. Eventually, P12 (P11) should move downward proper chain of reasoning.
toward its neutral-expression location to label a frame as an A recursive process starts with the first rule of the knowledge
“AU1 (AU2) offset”. Similarly, as implicitly suggested by the base. Then, it searches the R-list for a link between the fired rule
two right-hand-side graphs of Fig. 6, P7 should move upward, and the rule that the process will try to fire in the next loop. If
above its neutral-expression location, and the distance between such a relation does not exist, the procedure tries to fire the rule
points P5 and P7 should increase, exceeding its neutral-expres- that in the knowledge base comes after the rule last fired.
sion length, in order to label a frame as an “AU123 onset”. Inaccuracies in facial point tracking and occurrences of non-
In order to label a frame as “AU12 apex”, the increase of the prototypic facial activity may result in frames and temporal seg-
values of the relevant mid-level parameters should terminate. ments that are unlabeled (i.e., neither the onset, nor the apex, nor
Once the values of these mid-level parameters begin to decrease, the offset) and in frames and temporal segments that are labeled
a frame can be labeled as “AU12 offset”. Note that the two incorrectly. The latter may arise, for example, when an apex
right-hand-side graphs of Fig. 6 show two distinct peaks in the frame or an apex temporal segment of an AU is detected either
increase of the pertinent mid-level parameters. As shown by between two onset segments or between two offset segments.
Schmidt and Cohn [28], this is typical for spontaneous smiles To handle such situations, we employ a memory-based process
and in contrast to posed smiles. that takes into account the dynamics of facial expressions. More
Generally, for each and every AU, it must be possible to detect specifically, we examine the labels of both the previous and the
a temporal segment (an onset, apex, or offset) continuously over next frame/temporal segment and re-label the current frame /
at least five consecutive frames for the facial action in question temporal segment according to the ruled-based system summa-
to be scored. We determined this temporal duration empirically rized in Table IV.
based on a video frame rate of 24 frames/s (i.e., five frames have For instance, any unlabeled temporal segment and/or any
a duration of less than 1/4 of a second) and based on research apex segment that have been detected between two onset
findings that suggest that temporal changes in neuromuscular segments are re-labeled as “onset”. Finally, an AU should be
facial activity last from 1/4 of a second (e.g., a blink) to several recognized, in general, only when the full temporal pattern of
minutes (e.g., a jaw clench) [15]. that AU is observed (e.g., see Fig. 6 for cases of AU1, AU2, and
The employed fast direct chaining inference procedure takes AU12). However, in order to deal with fast transitions between
advantage of both a relational representation of the knowledge onsets and offsets, we score AUs even if the relevant apexes are
and a depth-first search to find as many conclusions as possible missing.
within a single “pass” through the knowledge base [49]. The use
V. AUTOMATIC SEGMENTATION OF AN INPUT VIDEO SEQUENCE
2Since the upward motion of the inner corner of the eyebrow is the principle
cue for the activation of AU1, the upward movement of the fiducial point P12 Virtually all the existing AU detectors only perform well on
is used as the criterion for detecting the onset of the AU1 activation. Reversal isolated or pre-segmented facial expression image sequences
of this motion is used to detect the offset of this facial expression. Similarly, the
upward movement of the outer corner of the eyebrow (i.e., point P11) is used as
showing a single temporal activation pattern of one or more
the criterion for detecting the onset of the AU2 activation. AUs. In reality, such segmentation is not available and, hence,
3The upward, oblique motion of the mouth corner is the principle cue for there is a need to find an automatic way of segmenting face
the activation of AU12. Hence, the upward movement of the fiducial point P7 image sequences into the different facial expressions pictured.
and the increase of the distance between points P5 and P7, typical for oblique A way to achieve this has been proposed by Otsuka and Ohya
(AU12) rather than sharp (AU13) upward movement of the mouth corner, are
used as the criteria for detecting the onset of the AU12 activation. Reversal of [50] and Cohen et al. [51]. To cope with cases where two facial
these motions is used to detect the offset of this facial expression. expressions of emotion are displayed contiguously, Otsuka and
442 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 36, NO. 2, APRIL 2006

TABLE I
RULES FOR RECOGNIZING AU1, AU2, AU4–AU7, AU9, AU10, AU12, AU13, AU15, AND AU16 FROM A FACE-PROFILE IMAGE SEQUENCE. LEGEND: FOR
NOTATIONAL SIMPLICITY, (x < x ) STANDS FOR (x < x 0"), (x = x j 0
) FOR ( x x j
"), (x > x ) FOR (x > x + "), t1 FOR THE
0
FIRST FRAME, t FOR THE CURRENT FRAME, t 1 FOR THE PREVIOUS FRAME. THE VALUE ASSIGNED TO " IS 1 pixel. THRESHOLD T 1 = 1/2 P13 P14
DISTINGUISHES BETWEEN THE ACTIVATION OF AU6, AU7, AU41 AND THAT OF AU44, AU43 AND AU45. THE VALUE OF T 1 HAS BEEN DECIDED BASED
UPON THE THRESHOLD DESCRIPTION PROVIDED BY THE RELEVANT FACS RULES

Ohya applied a heuristic approach and modified the employed less facial displays, we use a sequential facial expression model.
Hidden Markov model (HMM) computation such that when the A display of expressive facial behavior in video corresponds to
peak of a facial motion is detected, the current emotional expres- a temporal sequence of facial movement that we represent as
sion is assumed to start from the previous frame with minimal a sequence of temporal patterns (onset-apex-offset) of one or
facial motion. Similarly, Cohen et al. proposed a HMM-based more AUs. Since the presence of facial activity determines the
method for recognition of six basic emotions. This assumes that shown facial expression, its absence can be used to delimit the
the transitions between emotions pass through the neutral facial transition between different expressions. The term “neutral fa-
expression. Loosely speaking, we adopted a similar approach. cial expression” (“expressionless face”) is usually used to des-
To automatically segment an arbitrarily long video sequence ignate the absence of facial activity. Thus, to solve the segmen-
into the segments that correspond to expressive and expression- tation problem, we use a neutral-expressive-neutral sequential
PANTIC AND PATRAS: DYNAMICS OF FACIAL EXPRESSION 443

TABLE II
RULES FOR RECOGNIZING AU17, AU18, AU20, AND AU23–AU29 FROM A FACE-PROFILE IMAGE SEQUENCE. LEGEND: FOR NOTATIONAL SIMPLICITY,
(x < x ) STANDS FOR (x < x 0"), (x = x j 0
) FOR ( x x j
"), (x > x ) FOR (x > x + "), t1 FOR THE FIRST FRAME, t FOR THE
0
CURRENT FRAME, AND t 1 FOR THE PREVIOUS FRAME. THE VALUE ASSIGNED TO " IS 1 pixel. THRESHOLD T 2 = 1/2P 8 P10 DISTINGUISHES
BETWEEN THE ACTIVATION OF AU26 AND THAT OF AU27. THE VALUE OF T 2 HAS BEEN DECIDED BASED UPON EARLIER STUDIES ON
AUTOMATIC ANALYSIS OF FACIAL EXPRESSIONS FROM STATIC-FACE IMAGES [25]

model, where an “expressive” segment contains temporal pat- all facial expression sequences in the Cohn–Kanade Face Data-
terns (onset-apex-offset) of one or more AUs encoded by our base start with a neutral expression. However, in cases where no
AU recognizer. constraints are posed on input facial expression sequences, the
Since we assume that the input to our system consists of fa- proposed model will not suffice. Also, if one wants to segment
cial expression sequences that always start with a neutral fa- input face video in terms of specific facial displays such as the
cial expression, the neutral-expressive-neutral sequential facial emotional facial expressions, the proposed model will suffice
expression model suffices. The model will also be applicable only if each of these specific states begins and ends with a neu-
to the data contained in the Cohn–Kanade Face Database [54], tral facial expression. This is because the model has been devel-
which is one of the most commonly used data-sets in the re- oped to differentiate facial activity (presence of AUs) from inac-
search on automatic facial expression analysis. This is because tivity; it delimits the transition between different facial displays
444 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 36, NO. 2, APRIL 2006

TABLE III
RULES FOR RECOGNIZING AU36, AU41, AU43, AU44, AND AU45 FROM A FACE-PROFILE IMAGE SEQUENCE. LEGEND: FOR NOTATIONAL SIMPLICITY,
(x < x ) STANDS FOR (x < x 0
"), (x = x ) FOR ( x j 0
x j
"), (x > x ) FOR (x > x + "), t1 FOR THE 1ST FRAME, t FOR THE
0
CURRENT FRAME, AND t 1 FOR THE PREVIOUS FRAME. THE VALUE ASSIGNED TO " IS 1 pixel. FOR T 1 SEE TABLE I

TABLE IV
RULES FOR RESOLVING TEMPORAL CONFLICTS AND UNCERTAINTIES. EXCEPT FOR RULE 3, THE RULES ARE UTILIZED IN BOTH CASES: IF SINGLE FRAMES ARE
UNLABELED OR LABELED INCORRECTLY AND IF TEMPORAL SEGMENTS (A SEQUENCE OF AT LEAST FIVE CONSECUTIVE FRAMES) ARE UNLABELED OR LABELED
INCORRECTLY. RULE 3 IS UTILIZED FOR TEMPORAL SEGMENTS ONLY. RULE 4 HAS A MORE COMPLEX FORM FOR THE CASE OF TEMPORAL SEGMENTS. NAMELY,
ONLY IF A SEQUENCE ONSET-APEX-UNLABELED-APEX-OFFSET IS ENCOUNTERED, THE UNLABELED TEMPORAL SEGMENT WILL BE RE-LABELED AS “APEX”

based on the absence of facial activity rather than the difference vide a basis for benchmarks for all different efforts in research
in facial activity. Thus, in the case of neutral sad smile on machine analysis of facial expressions, no such database
neutral facial display, the proposed model will handle the “sad” has been yet created that is shared by all diverse facial-expres-
and “smile” segments as a single expressive segment rather than sion-research communities [9], [16], [29]. In general, only iso-
two distinct facial expressions. To achieve segmentation in spe- lated pieces of such a facial database exist. An example is the un-
cific facial displays and to handle cases where no constraints published database of Ekman–Hager Facial Action Exemplars
are posed on input facial expression sequences, an extended [52]. It has been used by several research groups (e.g., [20],
variable-neutral-expressive-variable sequential model, where a [21], [24]) to train and test their methods for AU detection from
“variable” segment contains either an expressive or neutral fa- frontal-view facial expression sequences. The facial expression
cial appearance, should be used. However, appropriate handling image databases that have been made publicly available but are
of these “variable” segments and the associated problems in- still not used by all diverse facial-expression-research groups
cluding the registration of the input video sequence and cancel- are the JAFFE database [53] and the Cohn–Kanade AU-coded
lation of noise is not an easy task. face image database [54].
None of these existing databases contains images of faces in
profile view and none contains images of all possible single-AU
VI. EXPERIMENTAL EVALUATION
activations. Also, the metadata (labels) associated with each
In spite of repeated calls for the need of a comprehensive, database object usually do not identify the temporal segments
readily accessible reference set of face images that could pro- (onset, apex, offset) of AUs and emotional facial displays shown
PANTIC AND PATRAS: DYNAMICS OF FACIAL EXPRESSION 445

Fig. 7. Examples of MMI-Face-Database images. First row: static frontal-view images. Second row: apex frames of dual-view video sequences.

in the face video in question. Finally, these databases are not quences, they were also asked to depict the temporal seg-
easily accessible and searchable. Once permission to use one ments of displayed AUs. When in doubt, decisions were
of these databases has been issued, large, unstructured files of made by consensus.
material are sent. As an attempt to address these issues, we have In order to test the AU recognition method described in the
created a novel facial-expression-image database, which we call previous sections, we used 26 profile-view and 70 dual-view
the MMI Face Database [55]. video sequences of the MMI Face Database (19 different sub-
The MMI Face Database has been developed to address all the jects in total). In the case of dual-view video sequences, we
issues mentioned above. It contains more than 1500 samples of used only the profile view of the face as the actual data. The
both static images and image sequences of faces in frontal and metadata associated with these 96 image sequences represent
in profile view displaying various facial expressions of emotion, the ground-truth with which we compared the judgments gener-
single AU activation, and multiple AU activation. It is publicly ated by our method. According to the neutral-expressive-neutral
available and it has been developed as a web-based direct-ma- sequential facial expression model described in Section V, the
nipulation application, allowing easy access and easy search of sequences were first segmented into the different facial expres-
the available images. All data samples stored in the database sions pictured. Then, we initialized nine profile-contour facial
have been acquired in the following way. points (P1–P6 and P8–P10, Fig. 2) as the extremities of the pro-
• Sensing: The static facial-expression images are all file contour as proposed in [25]. Other six facial points (P7 and
true color (24-bit) images which, when digitized, mea- P11–P15, Fig. 2) were manually initialized in the first frame
sure 720 576 pixels. There are approximately 600 of each of 96 test sequences. (Note, however, that the method
frontal-view images and 140 dual-view images (i.e., proposed in [36] can be easily trained to localize points P7 and
combining frontal and profile view of the face, recorded P11–P15 automatically.) The accuracy of the method was mea-
using a mirror) of facial expressions. All video sequences sured with respect to the misclassification rate of each “expres-
have been recorded at a rate of 24 frames/s using a sive” segment of the input sequence, not with respect to each
standard PAL camera. There are approximately 30 pro- frame.
file-view and 750 dual-view facial-expression sequences. The results are summarized in Table V. The first column of
The sequences are of variable length, lasting between 40 Table V lists all different AUs occurring in 96 test image se-
and 520 frames. Examples of recordings stored in the quences according to the ground-truth. The second column iden-
MMI Face Database are illustrated in Figs. 4, 5, and 7. tifies the total number of occurrences of each AU in the test data
• Subjects: Our database includes 52 different faces of stu- set according to the ground-truth. Correct means that the AUs
dents and research staff members of both sexes (44% fe- detected by our method were identical to AUs indicated by the
male), ranging in age from 19 to 62, having either a Eu- ground-truth. Partially correct denotes either that some, but not
ropean, Asian, or South American ethnic background. all, of the AUs indicated by the ground-truth were not recog-
• Samples: The subjects were asked to display expressions nized by our method (Missing AUs), or that some AUs that were
that included either a single AU or a prototypic combi- not indicated by the ground-truth were recognized in addition
nation of AUs (such as in expressions of emotion). They to those that were (Extra AUs). Incorrect means that none of
were instructed by an expert (a FACS coder) on how to the AUs indicated by the ground-truth were recognized by the
display the required facial expressions, and they were method. The overall recognition rate of the system has been cal-
asked to include a short neutral state at the beginning and culated with respect to both the number of input AUs indicated
at the end of each expression. The subjects were asked by the ground-truth and the number of input samples (i.e., the
to display the required expressions while minimizing number of “expressive” segments in the input video sequence).
out-of-plane head motions. The average recognition rate of the system with respect to the
• Metadata: Two experts (FACS coders) were asked to de- number of AUs has been calculated as the ratio between the
pict the AUs displayed in the images constituting the MMI number of correctly recognized AUs and the number of input
Face Database. In the case of facial-expression video se- AUs. The average recognition rate of the system with respect to
446 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 36, NO. 2, APRIL 2006

TABLE V
METHOD’S AU RECOGNITION PERFORMANCE FOR 96 TEST FACE-PROFILE IMAGE SEQUENCES. LEGEND: THE AVERAGE RECOGNITION RATE OF THE SYSTEM
WITH RESPECT TO THE NUMBER OF INPUT AUS: CORRECT/NR. OF OCCURRENCES. THE AVERAGE RECOGNITION RATE OF THE SYSTEM WITH
RESPECT TO THE NUMBER OF INPUT SAMPLES (i.e., TO THE NUMBER OF “EXPRESSIVE” SEGMENTS OF INPUT VIDEO SEQUENCES): NR.
OF CORRECTLY RECOGNIZED INPUT SAMPLES/THE TOTAL OF 119 INPUT SAMPLES

the number of input samples has been calculated as the ratio be- makes the mouth appear shorter. Note that AU23 and AU24 are
tween the number of correctly recognized input samples and the also often confused by human FACS coders [15] and by other
total of 119 input samples. We achieved an average recognition automated AU analyzers (e.g., [22]). In addition, note that the
rate of 93.6% input AUs-wise and an average recognition rate temporal pattern of feature motion in AU23 activation is very
of 86.6% input samples-wise (Table V). similar to the one occurring in AU24 activation. Hence, the dis-
As far as misidentifications produced by our method are con- tinction between these two AUs may be more amenable to ap-
cerned, most of them arose from confusion between similar pearance-based analysis than to feature motion analysis.
AUs (AU41 and AU43, AU23 and AU24) and from omission In addition to the misidentifications listed above, the mistaken
of very fast blinks (AU45 having a duration of less than five identifications of AU26 merit an explanation as well. In two
frames in either onset or offset). Both AU41 and AU43 cause cases, AU26 was present but the slightly parted teeth in a closed
the upper eyelid to drop down and narrow the eye opening. mouth remained undetected by human observers. In these cases,
Only the height of the eye opening distinguishes AU41 from our method coded the input samples correctly, unlike the human
AU43, causing misidentification of AU41 in the case where the observers.
observed subject has long eyelashes or an eye opening that is As can be seen from Fig. 8, the temporal segments of the
naturally narrow. Since both AU23 and AU24 tighten the lips AUs indicated by the ground-truth varied slightly from those
and reduce the height of the lips (vertical direction), only the detected by our method. In Fig. 8, the full line represents the
length of the lips (horizontal direction) distinguishes AU24 from values calculated by the method for the relevant mid-level pa-
AU23, causing misidentification of any AU23 that is accompa- rameters over the number of frames defined at the horizontal
nied by an unintentional, small, out-of-plane head motion that axis. The dotted line represents the temporal segments of the
PANTIC AND PATRAS: DYNAMICS OF FACIAL EXPRESSION 447

Fig. 8. Temporal segments of the AUs indicated by the ground truth (dashed line) and those detected by the method (dotted line). (a) AU2 activation in 163
+ +
frames of an “expressive” segment of AU1 2 5 video sequence (first two rows of Fig. 4). (b) AU12 activation in 92 frames of an “expressive” segment of
AU + +
6 12 25 video sequence (the fourth row of Fig. 5). (c) AU27 activation in 85 frames of an “expressive” segment of AU27 video sequence (not illustrated).

AUs calculated by the method. The dashed line represents an pression [8], [56]: the larger the motion (and, in turn, the defor-
abstraction of the temporal segments of the AUs indicated by mation in facial expression), the longer the response time. To
the ground-truth. For most AUs the boundaries of temporal seg- handle this, any “onset AU26” segment that has been detected
ments were detected either at the same moment or a little bit before the “onset AU27” segment is re-labeled as “onset AU27”.
later than prescribed by the ground-truth [Fig. 8(a)]. The mea- In turn, the onset of AU27 is detected without delays [Fig. 8(c)].
sured delays take up to three frames on average, that is, up to
1/8 of a second. However, in the case of AUs whose activation VII. CONCLUSIONS
becomes apparent from the movement of the mouth corner (i.e.,
AU12, AU13, AU15, and AU20), the temporal segments were Automating the analysis of facial signals, especially rapid fa-
almost always detected later than indicated by the ground-truth. cial signals (i.e., AUs), is important to advance studies on human
The measured delays have an average duration of three to six emotion and nonverbal communication, to design multimodal
frames, that is, up to 1/4 of a second [Fig. 8(b)]. The reason for human-machine interfaces, and to boost numerous applications
these delays is the temporal rules used for recognition of AU in fields as diverse as security, medicine, and education. In this
activation. It seems that human observers detect activation of paper, we presented a novel method for AU detection based
the AUs in question not only based on the presence of a certain upon changes in the position of the facial points tracked in a
movement (e.g., an upward movement of the mouth corner in the video of a near profile view of the face. The significance of this
case of AU12) but also based on the appearance of the facial re- contribution are the following.
gion around the mouth corner. Since appearance-based analysis • The presented approach to automatic AU recognition ex-
is not performed by the system, only the movement of the mouth tends the state of the art in automatic AU detection from
corner, which is detected usually later than the actual occurrence face image sequences in several ways, including the facial
of the movement (due to thresholding), indicates the presence of view (profile), the temporal segments of AUs (onset, apex,
the AUs in question, causing a delayed detection of these AUs. offset), the number (27 in total), and the difference in AUs
In addition, it is interesting to note that in cases of spontaneous (e.g., AU29, AU36) handled. To wit, the automated sys-
smiles, the human observers indicated the presence of multiple tems for AU detection from face video that have been re-
apexes of AU12 but, in contrast to the analysis performed by ported so far do not deal with the profile view of the face,
our system, did not indicate the presence of multiple full tem- cannot handle temporal dynamics of AUs, cannot detect
poral patterns (onset-apex-offset) of AU12 [Fig. 8(b)]. However, out-of-plane movements such as thrusting the jaw forward
whether this difference is just a matter of a human coder blindly (AU29), and, at best, can detect 16 to 18 AUs (from in total
applying an accepted coding scheme according to which such a 44 AUs).
“dampened” smile is represented as a multiple-apex-AU12 [28] • This paper provides a basic understanding of how to
or a matter of a genuine insensitivity of the human eye to subtle achieve automatic detection of AUs and their temporal
offsets of AU12 occurring in between the apexes of AU12 re- segments in a face-profile image sequence. Further re-
mains an interesting research question. search on facial expression symmetry, spontaneous vs.
Finally, upon a close inspection of the temporal rule used to posed facial expressions, and facial expression recogni-
recognize AU27 activation (rule 20, Table II), one may con- tion from multiple facial views can be based upon it.
clude that the onset of AU27 will always be detected later than Based upon the validation study presented in Section VI, it
indicated by the ground-truth. Namely, since both AU26 and can be concluded that the proposed method exhibits an accept-
AU27 pull down the lower jaw, only the extent of that pull dis- able level of expertise. The achieved results are similar to those
tinguishes AU27 from AU26, causing misidentifications in the reported for other automated FACS coders of face video. Com-
onset of AU27, that is, it causes a delayed detection of the onset pared to the AFA system [24], our method achieves an average
of AU27. This is consistent with experimental data that show recognition rate of 86.6% for encoding of 27 AU codes and
correlation between the extent of facial motion involved in a their combinations in 119 test samples, while the AFA system
facial expression and the delay in the recognition of that ex- achieves an average recognition rate of 87.9% for encoding of
448 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 36, NO. 2, APRIL 2006

16 AUs and their combinations in 113 test samples. In compar- human emotion and nonverbal communication [57]. This forms
ison to the system proposed recently by Bartlett et al. [22], our the main focus of our current and future research efforts.
method achieves an average recognition rate of 93.6% AU-wise
for encoding of 27 AUs and their combinations, while their ACKNOWLEDGMENT
system achieves an average recognition rate of 94.5% AU-wise
for encoding of 18 AUs and their combinations. The authors would like to thank the referees and M. J. Nieman
Except the profile view, the number of AUs, the difference in for their helpful comments and suggestions.
AUs, and the temporal dynamics handled, our method has also
improved other aspects of automated AU detection compared to REFERENCES
previously reported systems. In contrast to earlier approaches to [1] J. Russell and J. Fernandez-Dols, The Psychology of Facial Expres-
automated AU detection, our system facilitates automatic seg- sion. New York: Cambridge Univ. Press, 1997.
mentation of input image sequences into expressive and expres- [2] A. Mehrabian, “Communication without words,” Psych. Today, vol. 2,
no. 4, pp. 53–56, 1968.
sionless facial behavior pictured. Also, the performance of the [3] D. Keltner and P. Ekman, “Facial expression of emotion,” in Handbook
proposed method is invariant to occlusions like glasses and fa- of Emotions, M. Lewis and J. M. Haviland-Jones, Eds. New York:
cial hair as long as these do not entirely occlude facial fiducial Guilford, 2000, pp. 236–249.
[4] K. Mase, “Recognition of facial expression from optical flow,” IEICE
points (e.g., P10 in the case of a long beard). Finally, due to the Trans., vol. E74, no. 10, pp. 3474–3483, 1991.
usage of the color-based observation model (Section II-B), the [5] M. Black and Y. Yacoob, “Recognizing facial expressions in image se-
method performs well independently of changes in the illumi- quences using local parameterized models of image motion,” Comput.
Vis., vol. 25, no. 1, pp. 23–48, 1997.
nation intensity. [6] I. Essa and A. Pentland, “Coding, analysis, interpretation and recogni-
However, the method cannot recognize the full range of fa- tion of facial expressions,” IEEE Trans. Pattern Anal. Mach. Intell., vol.
cial behavior (i.e., all 44 AUs defined in FACS); it detects 27 19, no. 7, pp. 757–763, Jul. 1997.
[7] M. Pantic and L. J. M. Rothkrantz, “Expert system for automatic analysis
AUs occurring alone or in combination in a near profile-view of facial expression,” Image Vis. Comput. J., vol. 18, no. 11, pp. 881–905,
face image sequence. Although it has been reported that feature- 2000.
based methods are usually outperformed by holistic template- [8] A. M. Martinez, “Matching expression variant faces,” Vis. Res., vol. 43,
no. 9, pp. 1047–1060, 2003.
based methods using Gabor wavelets, Independent Component [9] M. Pantic and L. J. M. Rothkrantz, “Toward an affect-sensitive mul-
Analysis, and Eigenfaces [20], [53], the comparison given above timodal human-computer interaction,” Proc. IEEE, vol. 91, no. 9, pp.
indicates that our feature-based method performs just as well as 1370–1390, Sep. 2003.
the best template-based method proposed up to date (i.e. [22]). [10] C. Darwin, The Expression of the Emotions in Man and Ani-
mals. Chicago, IL: Univ. of Chicago Press, 1965.
We believe, however, that further research efforts toward com- [11] P. Ekman, Emotions Revealed. New York: Times Books, 2003.
bining both approaches are necessary if the full range of human [12] A. Ortony and T. J. Turner, “What is basic about basic emotions?,”
facial behavior is to be coded in an automatic way. Psych. Rev., vol. 74, pp. 315–341, 1990.
[13] K. R. Scherer and P. Ekman, Handbook of Methods in Non-Verbal Be-
If we consider the state of the art in face detection and facial havior Research. Cambridge, U.K.: Cambridge Univ. Press, 1982.
point localization and tracking, noisy and partial data should [14] P. Ekman and W. V. Friesen, Facial Action Coding System. Palo Alto,
CA: Consulting Psychologist Press, 1978.
be expected. As remarked by Pantic et al. [9], [19], a facial [15] P. Ekman, W. V. Friesen, and J. C. Hager, Facial Action Coding
expression analyzer should be able to deal with these imper- System. Salt Lake City, UT: A Human Face, 2002.
fect data and to generate its conclusion so that the certainty [16] M. Pantic and L. J. M. Rothkrantz, “Automatic analysis of facial expres-
associated with it varies with the certainty of face and facial sions: The state of the art,” IEEE Trans. Pattern Anal. Mach. Intell., vol.
22, no. 12, pp. 1424–1445, Dec. 2000.
point localization and tracking data. To deal with inaccuracies [17] H. Tao and T. S. Huang, “Connected vibrations: a modal analysis ap-
in facial point tracking, our method employs a memory-based proach to nonrigid motion tracking,” in Proc. IEEE Int. Conf. Computer
process that takes into account the dynamics of facial expres- Vision and Pattern Recognition, 1998, pp. 735–740.
[18] S. B. Gokturk, J. Y. Bouguet, C. Tomasi, and B. Girod, “Model-based
sions (Table IV). However, our method does not calculate the face tracking for view-independent facial expression recognition,” in
output data certainty by propagating the input data certainty Proc. IEEE Int. Conf. Face and Gesture Recognition, 2002, pp. 272–278.
(i.e., the certainty of facial point tracking). Future work on this [19] M. Pantic, “Face for interface,” in The Encyclopedia of Multimedia Tech-
nology and Networking, M. Pagani, Ed. Hershey, PA: Idea Group Ref-
issue aims at investigating on the use of measures that can ex- erence, 2005, vol. 1, pp. 308–314.
press the confidence to facial point tracking and that can facili- [20] M. S. Bartlett, J. C. Hager, P. Ekman, and T. J. Sejnowski, “Measuring
tate both more robust AU recognition and the assessment of the facial expressions by computer image analysis,” Psychophysiology, vol.
36, pp. 253–263, 1999.
certainty of the performed AU recognition. [21] G. Donato, M. S. Bartlett, J. C. Hager, P. Ekman, and T. J. Sejnowski,
Finally, our method assumes that the input data are near pro- “Classifying facial actions,” IEEE Trans. Pattern Anal. Mach. Intell.,
file-view face image sequences showing facial displays that al- vol. 21, no. 10, pp. 974–989, Oct. 1999.
ways begin with a neutral state. In reality, such an assumption [22] M. S. Bartlett, G. Littlewort, C. Lainscsek, I. Fasel, and J. R. Movellan,
“Machine learning methods for fully automatic recognition of facial ex-
cannot be made; variations in the viewing angle should be ex- pressions and facial actions,” in Proc. IEEE Int. Conf. Systems, Man, and
pected. Also, human facial behavior is more complex and tran- Cybernetics, 2004, pp. 592–597.
sitions from a facial display to another do not have to involve [23] J. F. Cohn, A. J. Zlochower, J. Lien, and T. Kanade, “Automated face
analysis by feature point tracking has high concurrent validity with
intermediate neutral states. As a consequence, the proposed fa- manual faces coding,” Psychophysiology, vol. 36, pp. 35–43, 1999.
cial expression analyzer cannot deal with spontaneously oc- [24] Y. Tian, T. Kanade, and J. F. Cohn, “Recognizing action units for facial
curring facial behavior. Yet, answering the question of how to expression analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23,
achieve parsing the stream of facial and head movements not no. 2, pp. 97–115, Feb. 2001.
[25] M. Pantic and L. Rothkrantz, “Facial action recognition for facial expres-
under volitional control is essential for the realization of multi- sion analysis from static face images,” IEEE Trans. Syst., Man, Cybern.
modal human-machine interfaces and for advancing studies on B, Cybern., vol. 34, no. Jun., pp. 1449–1461, 2004.
PANTIC AND PATRAS: DYNAMICS OF FACIAL EXPRESSION 449

[26] M. F. Valstar, M. Pantic, and I. Patras, “Motion history for facial action [51] I. Cohen, N. Sebe, A. Garg, L. S. Chen, and T. S. Huang, “Facial expres-
detection from face video,” in Proc. IEEE Int. Conf. Systems, Man, and sion recognition from video sequences: temporal and static modeling,”
Cybernetics, 2004, pp. 635–640. Comput. Vis. Image Understand., vol. 91, pp. 160–187, 2003.
[27] Y. Yacoob, L. Davis, M. Black, D. Gavrila, T. Horprasert, and [52] P. Ekman, J. Hager, C. H. Methvin, and W. Irwin, “Ekman–Hager Facial
C. Morimoto, “Looking at people in action,” in Computer Vi- Action Exemplars,” Human Interaction Lab., Univ. of California, San
sion for Human–Machine Interaction, R. Cipolla and A. Pentland, Francisco.
Eds. Cambridge, U.K.: Cambridge Univ. Press, 1998, pp. 171–187. [53] M. J. Lyons, J. Budynek, and S. Akamatsu, “Automatic classification of
[28] K. L. Schmidt and J. F. Cohn, “Dynamics of facial expression: Norma- single facial images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21,
tive characteristics and individual differences,” in Proc. IEEE Int. Conf. no. 12, pp. 1357–1362, Dec. 1999.
Multimedia and Expo, 2001, pp. 547–550. [54] T. Kanade, J. Cohn, and Y. Tian, “Comprehensive database for facial ex-
[29] Human Interaction Laboratory, “Final Report to NSF of the Planning pression analysis,” in Proc. IEEE Int. Conf. Automatic Face and Gesture
Workshop on Facial Expression Understanding,” Univ. of California, Recognition, 2000, pp. 46–53.
San Francisco, CA, P. Ekman, T. S. Huang, T. J. Sejnowski, and J. C. [55] M. Pantic, M. F. Valstar, R. Rademaker, and L. Maat, “Web-based data-
Hager, Eds., 1993. base for facial expression analysis,” in Proc. IEEE Conf. Multimedia
[30] M. Mendolia and R. E. Kleck, “Watching people talk about their emo- and Expo, 2005, [Online] Available at: http://www.mmifacedb.com/, pp.
tions—Inferences in respons to full-face vs. profile expressions,” Motiv. 317–321.
Emotion, vol. 15, no. 4, pp. 229–242, 1991. [56] A. W. Young, D. Rowland, A. J. Calder, N. L. Etcoff, A. Seth, and D.
[31] J. C. Hager, “Asymmetry in facial muscular actions,” in What the Face I. Perrett, “Facial expression megamix: Test of dimensional and cate-
Reveals, P. Ekman and E. L. Rosenberg, Eds. New York: Oxford Univ. gory accounts of emotion recognition,” Cognition, vol. 63, pp. 271–313,
Press, 1997, pp. 58–62. 1997.
[32] S. Mitra and Y. Liu, “Local facial asymmetry for expression classifica- [57] M. Pantic, N. Sebe, J. F. Cohn, and T. Huang, “Affective multimodal
tion,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, human-computer interaction,” in Proc. ACM Int. Conf. Multimedia,
2004, pp. 889–894. 2005.
[33] A. M. Martinez, “Recognizing imprecisely localized, partially occluded,
and expression variant faces from a single sample per class,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 24, no. 6, pp. 748–763, Jun. 2002.
[34] M. Pantic, I. Patras, and L. J. M. Rothkrantz, “Facial action recognition
in face profile image sequences,” in Proc. IEEE Int. Conf. Multimedia
and Expo, 2002, pp. 37–40.
Maja Pantic (S’98–M’02) received the M.S. and
[35] M. Pantic and I. Patras, “Temporal modeling of facial actions from face
Ph.D. degrees in computer science from Delft
profile image sequences,” in Proc. IEEE Int. Conf. Multimedia and Expo,
University of Technology, Delft, The Netherlands,
vol. 1, 2004, pp. 49–52.
in 1997 and 2001, respectively.
[36] D. Vukadinovic and M. Pantic, “Fully automatic facial feature point de-
From 2001 to 2005, she was an Assistant Professor
tection using Gabor feature based boosted classifiers,” in Proc. IEEE
in the Department of Electrical Engineering, Math-
Int. Conf. Systems, Man, and Cybernetics, Vancouver, BC, Canada, Oct.
ematics and Computer Science, Delft University of
2005, pp. 1692–1698.
Technology. She is currently an Associate Professor
[37] M. Isard and A. Blake, “Condensation—conditional density propagation
in the same department, where she is doing research
for visual tracking,” Int. J. Comput. Vis., vol. 29, no. 1, pp. 5–28, 1998.
in the area of machine analysis of human interactive
[38] , “Icondensation: Unifying low-level and high-level tracking in a
cues for achieving a natural, multimodal human–ma-
stochastic framework,” in Proc. Eur. Conf. Computer Vision, 1998, pp.
chine interaction. She has published over 40 technical papers in the areas of
893–908.
machine analysis of facial expressions and emotions, artificial intelligence, and
[39] M. K. Pitt and N. Shephard, “Filtering via simulation: auxiliary particle
human-computer interaction and has served as an invited speaker and a program
filtering,” J. Amer. Stat. Assoc., vol. 94, pp. 590–599, 1999.
committee member at several conferences in these areas.
[40] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial
Dr. Pantic received the Innovational Research Award of the Dutch Scientific
on particle filters for online nonlinear/non-Gaussian Bayesian tracking,”
Organization for her research on Facial Information For Advanced Interface in
IEEE Trans. Signal Process., vol. 50, no. 2, pp. 173–188, Feb. 2002.
2002, as one of the seven best young scientists in exact sciences in the Nether-
[41] J. MacCormick and A. Blake, “Probabilistic exclusion and partitioned
lands.
sampling for multiple object tracking,” Int. J. Comput. Vis., vol. 39, no.
1, pp. 57–71, 2000.
[42] P. Perez, C. Hue, J. Vermaak, and M. Gangnet, “Color-based prob-
abilistic tracking,” in Proc. Eur. Conf. Computer Vision, 2002, pp.
661–675.
[43] “Special issue on sequential state estimation: from Kalman filters to
particle filters,” Proc. IEEE, vol. 92, no. 3, pp. 399–574, Mar. 2004. Ioannis Patras (S’97–M’02) received the B.Sc. and
[44] L. D. Harmon, M. K. Khan, R. Lash, and P. F. Raming, “Machine iden- M.Sc. degrees in computer science from the Com-
tification of human faces,” Pattern Recognit., vol. 13, pp. 97–110, 1981. puter Science Department, University of Crete, Her-
[45] H. T. Nguyen, M. Worring, and R. vd. Boomgaard, “Occlusion robust aklion, Greece, in 1994 and in 1997, respectively, and
adaptive template tracking,” in Proc. IEEE Int. Conf. Computer Vision, the Ph.D. degree from the Department of Electrical
vol. 1, 2001, pp. 678–683. Engineering, Delft University of Technology, Delft,
[46] J. Vermaak, P. Perez, M. Gangnet, and A. Blake, “Toward improved ob- The Netherlands, in 2001.
servation models for visual tracking: Selective adaptation,” in Proc. Eur. From 2001 to 2003, he was a Postdoctorate
Conf. Computer Vision, 2002, pp. 645–660. Researcher in the area of multimedia analysis at the
[47] Y. Wu and T. Huang, “A co-inference approach to robust tracking,” in University of Amsterdam, Amsterdam, The Nether-
Proc. IEEE Int. Conf. Computer Vision, vol. 2, 2001, pp. 26–33. lands. From 2003 to 2005, he was a Postdoctorate
[48] P. J. Huber, Robust Statistics. New York: Wiley, 1981. Researcher in the area of vision-based human machine interaction (focusing
[49] M. Schneider, A. Kandel, G. Langholz, and G. Chew, Fuzzy expert on facial and body gesture analysis) at Delft University of Technology. Since
system tools. West Sussex, U.K.: Wiley, 1997. 2005, he has been a Lecturer at the Computer Vision and Pattern Recognition
[50] T. Otsuka and J. Ohya, “Recognizing multiple persons’ facial expres- Group, The University of York, York, U.K. His research interests lie mainly
sions using HMM based on automatic extraction of significant frames in the areas of computer vision and pattern recognition and their applications
from image sequences,” in Proc. IEEE Int. Conf. Image Processing, vol. in multimedia data management, multimodal human machine interaction, and
2, 1997, pp. 546–549. visual communications.

You might also like