Facs 3dmm

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

A FACS Valid 3D Dynamic Action Unit Database with Applications to 3D

Dynamic Morphable Facial Modeling

Darren Cosker Eva Krumhuber


Department of Computer Science School of Humanities and Social Sciences
University of Bath Jacobs University
[email protected] [email protected]

Adrian Hilton
Centre for Vision, Speech and Signal Processing
University of Surrey
[email protected]

Abstract [19, 14, 17]. Expression recognition in particular is a highly


active research area, with many works based on move-
This paper presents the first dynamic 3D FACS data ment descriptions from the Facial Action Coding System
set for facial expression research, containing 10 subjects (FACS) [12]. FACS was primarily introduced by psychol-
performing between 19 and 97 different AUs both indi- ogists to describe different configurations of facial actions
vidually and in combination. In total the corpus contains or Action Units (AUs). FACS lists 44 AUs that form the
519 AU sequences. The peak expression frame of each se- basis of 6 prototypical facial expressions: happiness, sad-
quence has been manually FACS coded by certified FACS ness, fear, surprise, anger and disgust. Numerous attempts
experts. This provides a ground truth for 3D FACS based exist to classify these movements in both static and dy-
AU recognition systems. In order to use this data, we de- namic 2D sequences [17, 1, 19]. Perhaps the most thorough
scribe the first framework for building dynamic 3D mor- set collected to date is the Extended Cohn-Kanade Dataset
phable models. This includes a novel Active Appearance (CK+) [17], which contains 593 sets of expressions with the
Model (AAM) based 3D facial registration and mesh cor- peaks manually FACS coded to establish AU presence.
respondence scheme. The approach overcomes limitations The ability to FACS code data automatically has a wide
in existing methods that require facial markers or are prone potential in social psychological research on the under-
to optical flow drift. We provide the first quantitative as- standing of facial expressions. One major reason for this
sessment of such 3D facial mesh registration techniques and is that manual coding is highly time consuming and often
show how our proposed method provides more reliable cor- not practical for long dynamic sequences. FACS is also
respondence. now often used as the movement basis for 3D facial models
in movies, making automatic analysis relevant to motion-
1. Introduction capture and performance mapping [11]. However, while
Facial analysis using 3D models has become a popular available data for 2D analysis is widespread, there are only
research topic in recent years. Some of the primary benefits a handful of 3D facial data sets available [8, 23]. Data sets
of such models include potentially improved robustness to portraying 3D dynamic movement are fewer still [22], do
pose and illumination changes during recognition [3], esti- not contain AU level motions, and are not FACS coded.
mation of 3D facial shape from 2D images [2, 20], and mo- There is therefore clearly a need for dynamic 3D FACS
tion capture [13]. Given this emerging popularity, a great data sets comparable to the state of the art in 2D. However,
need exists for rigorous and standardized 3D dynamic fa- given such a corpus, approaches are also required for the
cial data sets that the computer vision community can use modeling and utilization of this data. A popular model for
for experimentation. 3D facial analysis is the morphable model [3]. This uses a
There are a range of available data sets for 2D facial basis of static 3D laser range scans of different subjects to
analysis – both static and dynamic – containing variation learn a statistical space of shape and texture deformation.
in pose, illumination, expression and disguise (e.g. see However, in order to build such a model the scans must
first be non-rigidly registered to a common space. This pro- the data allows for thorough research in a range of tasks:
cess is required to achieve 3D mesh correspondence. While large scale 3D model building, registration of 3D faces, and
Blanz and Vetter [3] rely solely on optical flow to densely tracking of 3D models to 2D video.
register images, Patel and Smith [20] improve accuracy by The peak frame of each sequence has been manually
employing a set of manually labeled facial feature points. FACS coded by certified FACS experts. These are individ-
Even though 3D morphable models are potentially pow- uals whom have passed the FACS final test [12]. This pro-
erful tools for facial analysis, previous work to date has vides the first ground truth for 3D FACS based AU recogni-
only used static 3D scans of faces to build models. There is tion systems, as well as a valuable resource for building 3D
therefore great potential for extending the framework to in- dynamic morphable models for motion capture and synthe-
corporate dynamic data. However, the problem with build- sis using AU based parameters.
ing such models lies again in non-rigid registration. In Secondly, our paper provides a description of the first
the context of dynamic 3D data, this requires the creation framework for building dynamic 3D morphable facial mod-
of spatio-temporal dense feature correspondences through- els. This extends the state of the art in static 3D morphable
out the sequences. The problem is more complex than us- model construction to incorporating dynamic data. In de-
ing static scans alone since registration must reliably track scribing this framework, we also propose an Active Appear-
highly variable nonlinear skin deformations [13]. ance Model (AAM) [9] based approach for densely and reli-
One approach for achieving correspondence given dy- ably registering captured 3D surface data. This method has
namic 3D sequences is to register facial images using opti- several advantages over existing dynamic 3D facial regis-
cal flow vectors tracked dynamically in multiple 2D stereo tration methods: (1) it requires no paint or special markers
views [7, 24]. However, drift in the flow (caused by e.g. on the face, and (2) it shows improved performance over
violation of the brightness consistency assumption) typi- optical flow based strategies which accumulate drift over
cally accumulates over time introducing errors. Borshukov time [7, 6, 24]. We compare the AAM based method to
et al [6] overcome this problem by manually correcting the the state of the art in optical flow and mesh regularization
mesh positions when drift occurs. More recently, Bradley et schemes. This provides the first quantitative assessment
al [7] mosaiced the views of 14 HD cameras to create high of popular methods adopted in this area. We also include
resolution images for skin pore tracking. By back calculat- a comparison to techniques used in static 3D morphable
ing optical flow to the initial image drift is also reduced. In model construction [2], and highlight limitations in directly
addition, mesh regularization ensures that faces do not flip applying these approaches given dynamic data.
due to vertices overlapping. Other solutions to the registra-
tion problem include the use of facial markers and special 2. Dynamic 3D FACS Dataset (D3DFACS)
make-up to track consistent points [18].
Existing non-rigid registration methods for dynamic fa- 2.1. Capture Protocol and Contents Overview
cial data therefore have drawbacks: they rely on optical flow Our aim was to capture a variety of facial movements
which is prone to drift over time, or use painted facial mark- as performed by a range of posers. For this, we recruited
ers to acquire stable points. There is therefore clear scope 4 expert FACS coders and 6 FACS-untrained participants
for improvement. Previous work on these methods has also for our data set. The performer age range was 23 to 41
only been applied to animation, where errors can be hand years (average age 29.3 years), and consisted of 6 females
corrected. A more quantitative assessment of their merits and 4 males, all of Caucasian European origin. The expert
would therefore also be of benefit to the computer vision coders, having extensive knowledge of FACS, allowed us to
community. Finally, given a reliable means to non-rigidly elicit more complex AU combinations than would be pos-
register 3D facial data efficiently, the opportunity for build- sible for FACS unfamiliar people. Each FACS expert spent
ing dynamic 3D morphable models becomes possible. time before the session practicing the combinations as well
as possible. The FACS unfamiliar participants were pro-
1.1. Contributions
vided coaching before the session on a reduced set of AUs
This paper makes several contributions: It presents the and expressions. For a discussion on how easily people find
first dynamic 3D FACS data set for facial expression re- performing different AUs, the reader is referred to [15].
search, portraying 10 subjects performing between 19 and In total we recorded between 80 and 97 AU sequences
97 different AUs both individually and in combination. In (including Action Descriptors (ADs) [12]) for each FACS
total the data set contains 519 AU sequences. Compared expert performer, and between 19 and 38 sequences for each
with other state of the art 2D [17] and 3D [22] facial data FACS non-expert. This number depended on the ability of
sets which contain more subjects, we provide substantially a performer to produce the desired sequence, which either
more expressions per subject. As well as allowing com- targeted a specific single AU, or a combination of AUs. We
prehensive experimentation on per person facial movement, selected the combinations based on criteria for (1) the six
basic emotions outlined by Ekman et al [12], and (2) non- from onset to peak were extracted for scoring by a FACS
additive appearance changes. These latter combinations are expert. This led to the following data set:
particularly interesting since they reveal new appearance
characteristics for their joint activation that cannot be traced • 519 AU sequences (single and in combination) from 10
back to the sum of single AUs (e.g. 1+4). In total 519 se- people, including 4 expert coders and 6 non-experts.
quences were captured, comprising of 1184 AUs in total.
• Each sequence is approximately 90 frames long at 60
Table 1 shows the frequency of each AU in the data set.
FPS and consists of OBJ mesh and BMP cylindrical
2.2. Dynamic 3D Capture and Data Format UV texture map data.
Each FACS performer was recorded using a 3DMD dy- • AU codes for each peak frame of each sequence are
namic 3D stereo camera [21] (see Figure 1). The system scored by a FACS expert.
consists of six cameras split between two pods, with 3 cam-
eras stacked vertically on each pod. Each pod produces a Instructions for acquiring the database may be found
3D reconstruction of one half of the face. The top and bot- at http://www.cs.bath.ac.uk/˜dpc/D3DFACS/. In
tom cameras of each pod are responsible for stereo recon- the remainder of the paper we describe our framework for
struction and middle cameras are responsible for capturing building dynamic 3D morphable facial models. We also in-
UV color texture. The system samples at 60 FPS and pro- troduce our AAM based approach for mesh registration in
vides (1) OBJ format 3D mesh data consisting of the two dynamic sequences and compare it to: (1) existing work on
pod half-face meshes joined together, and (2) correspond- facial mesh correspondence and (2) registration techniques
ing BMP format UV color texture map data for each frame. employed in static 3D data for morphable modeling [20, 2].
The texture mapping provided by the system is originally
a stereo one, meaning that it consists of the color camera 3. 3D Dynamic Morphable Model Framework
views from the two pods joined together into one image.
In the following Section, we first provide an overview of
We modify this by converting the mapping into a cylindri-
static 3D morphable model construction before describing
cal one. This means that each mesh has a UV texture map
extensions to dynamic sequences. In static 3D morphable
equivalent to placing a cylinder around the head and pro-
model construction, as proposed by Blanz and Vetter [2], a
jecting the color information on the cylinder. The mesh
set of 200 facial scans (each of a different person) is taken
data consists of approximately 30K vertices per mesh, and
from a Cyberware 2020PS laser range scanner. These are
each UV map is 1024x1280 pixels. Figure 2 shows example
represented in a 2D space and aligned to a common coor-
images of the FACS performers, including corresponding
dinate frame using a dense optical flow alignment. Patel
mesh and UV map data.
and Smith [20] improve the accuracy of the alignment by
manually placing 2D landmarks on the faces, and then us-
ing a Thin Plate Spline (TPS) based warping scheme [4].
Procrustes analysis is also performed to remove head pose
variation. After correspondence, the 2D UV space which
also contains a mapping to 3D shape is sampled to gener-
ate the 3D mesh information. Both the UV texture data and
the 3D mesh data are then represented using linear Principle
Component Analysis (PCA) models.
We propose several extensions to this process for build-
ing dynamic 3D morphable models. Given several single
Figure 1. Dynamic 3D Stereo Camera used for data collection. Six dynamic sequences (e.g. an AU combination) consisting of
cameras combine to provide 3D reconstructions of the face, with a multiple 3D meshes and corresponding UV texture maps:
recording rate of 60 FPS. Step 1: For each mesh, generate a mapping from each
pixel in 2D UV space to a vertex position in 3D space. This
For each sequence the camera was set to record for be- mapping is I(u) = v, where v ∈ R3 is a 3D vector coor-
tween 5 and 10 seconds depending on the complexity of the dinate, I is a UV map, and u is a coordinate (u, v). The
AU. Performers were asked to repeat AU targets as many function can be generated using a Barycentric coordinate
times as possible during this period. A mirror was set up in mapping between mesh faces in 2D UV space and faces in
front of the actor so that they could monitor their own ex- 3D vertex space (see Section 4.1).
pressions before and during each capture. Recording took Step 2: Perform stand-alone non-rigid registration of
between 2 and 7 hours per participant. After all data record- each separate UV texture map sequence. This process iden-
ing, the sequences which most visually matched the targets tifies and tracks image features through neighboring image
AU Description Total AU Description Total AU Description Total
1 Inner Brow Raiser 45 17 Chin Raiser 118 31 Jaw Clencher 4
2 Outer Brow Raiser 36 18 Lip Pucker 26 32 Lip Bite 5
4 Brow Lowerer 56 19 Tongue Out 3 33 Cheek Blow 4
5 Upper Lid Raiser 42 20 Lip Stretcher 30 34 Cheek Puff 3
6 Cheek Raiser 16 21 Neck Tightener 6 35 Cheek Suck 3
7 Lid Tightener 38 22 Lip Funneler 15 36 Tongue Bulge 4
9 Nose Wrinkler 36 23 Lip Tightener 47 37 Lip Wipe 3
10 Upper Lip Raiser 97 24 Lip Pressor 22 38 Nostril Dilator 29
11 Nasolabial Deepener 16 25 Lips Part 164 39 Nostril Compressor 9
12 Lip Corner Puller 77 26 Jaw Drop 63 43 Eyes Closed 13
13 Cheek Puffer 5 27 Mouth Stretch 14 61 Eyes Turn Left 4
14 Dimpler 32 28 Lip Suck 8 62 Eyes Turn Right 4
15 Lip Corner Depressor 28 29 Jaw Thrust 4 63 Eyes Turn Up 4
16 Lower Lip Depressor 42 30 Jaw Sideways 5 64 Eyes Turn Down 4

Table 1. AU frequencies identified by manual FACS coders in the D3DFACS data set (based on FACS descriptions in Ekman et al [12]).

Figure 2. Examples from the D3DFACS data set. The top two rows show camera views from 6 participants. The bottom two rows show
3D mesh data (textured and un-textured), and corresponding UV texture maps.

sequence frames. In this paper we propose a dense AAM linear texture PCA model (see Section 4.2)
based approach to achieve registration and compare to state Step 4: Regularly sample the UV space to calculate 3D
of the art approaches. Directly applying optical flow for vertices for each corresponding mesh. The more accurate
registration without a mesh regularization term (as in [2]) the pixel based registration is, the more accurate the mesh
produces drift artifacts in the meshes and images. Even with correspondence (see Section 4.2).
a regularization term (as in [7, 24]) tracking accuracy still Step 5: Perform rigid registration of the 3D mesh data.
depends on optical flow quality which can be error prone Since sequences at this point have 3D mesh correspon-
(see Section 4.2). Registration is with respect to a neutral dence, Procrustes analysis [5] may be applied to align the
expression image selected from the sequence. meshes in an efficient manner. This removes head pose vari-
Step 3: Perform global non-rigid registration of the dy- ation in the dynamic sequences.
namic sequences. One of the neutral sequence poses is Step 6: Build linear PCA models for shape and texture
chosen as a global template to which each of the UV se- using the registered 3D mesh and UV texture data.
quences is then registered using a single dense warping per We now expand on the above process concentrating pri-
sequence. This registered UV space provides data for the marily on the procedures for non-rigid registration.
4. 3D Registration and Correspondence Zhang et al [24] and Borshukov et al [6] use optical flow
from stereo image pairs to update a mesh through a se-
4.1. Creating a 2D to 3D Mapping (Step 1) quence. They use this technique for animation applications.
A sequence of data consists of a set of meshes The mesh is initialized in frame 1, and its vertices moved
X = [X1 , . . . Xn ], where X = [xT1 , xT2 . . . xTm ]T , xi = to optimal positions in successive frames using flow vec-
[xix , xiy , xiz ]T ∈ R3 . There also exists a set of UV tex- tors merged from each stereo view. The update is also
ture maps I = [I1 . . . In ], and a set of UV coordinates combined with a mesh regularization constraint to avoid
U = [U1 , U2 , . . . Un ], where U = [uT1 , uT2 , uTm ]T , and flipped faces. We extend this approach by using a single
ui = [ui , vi ] ∈ R2 . The UV texture maps supply color UV space for optical flow calculation and mesh updating as
data to the mesh in the form of images, with the UV co- opposed to merging stereo flow fields. For regularization,
ordinates linking individual vertices xi to unique points on we sparsely sample the flow field and interpolate the po-
these images ui . In the above definitions, n is the number sitions of in-between points using TPS warping (see AAM
of meshes and corresponding UV maps in a sequence. Sim- and TPS next). This ensures that flow vectors follow the
ilarly, m is the number of vertices in a mesh and the number behavior of the sparse control points, but as with previous
of corresponding UV coordinates. approaches does not guarantee against tracking errors accu-
There also exists a set of common triangular faces per mulating due to optical flow drift.
mesh Fi , i = 1 . . . n, where faces in the 3D vertex space AAM and TPS: Patel and Smith [20] achieve corre-
correspond to the same faces in the 2D texture space. The spondence in 3D morphable model construction by man-
entire set of faces for a sequence may also be defined as ually landmarking 3D images and aligning them using TPS
F = [F1 , . . . Fn ]. based warping. TPS is a type of Radial Basis Function
We approach 3D correspondence as a 2D image regis- (RBF). RBFs can be used to define a mapping between any
tration problem. From a theoretical point of view, perfect point defined with respect to a set of basis control points
one-to-one pixel registration between successive face im- (e.g. landmarks in one image), and its new position given
ages relates to perfect 3D mesh correspondence. The goal a change in the control points (e.g. landmarks in the tar-
is to achieve as near an optimal correspondence as possible. get image). Thus, a dense mapping between pixels in two
It is therefore useful from an implementation point of images may be defined. TPS itself provides a kernel that
view to work primarily in 2D space. We first generate 3D models this mapping based on a physical bending energy
images I3D (u) = x. This is achieved by taking each face in term (for more detail see [4]).
turn, and for each pixel within its triangle in 2D space cal- This approach has several advantages over optical flow
culating the corresponding 3D position using a Barycentric based correspondence: (1) the TPS warp provides a smooth
coordinate mapping. Repeating for each triangle results in a and dense warping field with no drift artifacts in the mesh
dense 2D pixel to 3D vertex mapping for the entire UV map. or images, and (2) manual point placement guarantees cor-
Operations performed on I are from now on also applied to respondence of key facial features whereas optical flow is
I3D , including optical flow and TPS warping. prone to drift. We extend this to dynamic sequences by
building an AAM and using it to automatically track feature
4.2. Non-Rigid Alignment (Steps 2 and 3) points through a dynamic sequence of multiple UV maps.
Pixels (or (u, v) coordinates) within the control points are
We now describe several strategies for non-rigid align-
corresponded with those in neighboring frames using the
ment, including our proposed method. In Section 5 we then
TPS mapping, which warps each I and I3D to a common
provide experimental results comparing these.
coordinate frame (the neutral expression).
Optical Flow: Blanz and Vetter [2] calculate smoothed
AAMs are well known in the computer vision literature,
optical flow to find corresponding features between images and a thorough overview may be found in [9]. We use the
of 200 different people. However, the formulation and
same principles here and define our AAM as:
choice of features is tuned to the particular data. In this
work we consider a more standardized approach and extend l = l̄ + Pl WQl c g = ḡ + Pg Qg c (1)
to dynamic sequences. We calculate concatenated Lukas-
Kanade (LK) [16] flow fields that warp images between I+i where l is a vector of image landmarks, l̄ are the mean land-
and I0 , where I0 is the neutral expression image (UV map). marks learned from training, g is a vector of image pixels in-
Flow is summed for the images between I + i and I0 , pro- side the region defined by l, and ḡ are the mean image pixels
viding the concatenated flow. Smoothing of the flow field learned from training. The eigenvectors of the training sets
is applied in the form of local averaging in both the spatial of vectors l and g are the matrices Pl and Pg respectively.
and temporal domains. Flow fields calculated from the UV The matrix W is a set of scaling weights, the matrix Q rep-
maps are also then applied to the I3D images. resents the eigenvectors of the joint distribution of landmark
Optical Flow and Regularization: Bradley et al [7], and image data, and c is the appearance parameter.
Fitting AAMs to new images is a well covered topic in which extends [2], (2) a combined optical flow and regular-
the computer vision literature (see [9, 17]). In this work we ization approach similar to [7, 24], and (3) the new AAM-
define a simple minimization approach which compares the TPS combination approach proposed in this paper.
current guess to the models best reconstruction of this: For test purposes we selected 8 dynamic AU sequences
from our data set consisting of approximately 65 frames
E = min(gl − (PT (gl − ḡ)) (2)
c each. For optical flow we use the pyramidal Lucas-Kanade
where gl is portion of the image I within the area defined by (LK) algorithm as in [7]. We first wished to compare how
l (the current guess). Calculating gl requires first calculat- well the AAM and LK algorithms tracked facial feature
ing l using c (in (1)), and then warping this region into the points versus a ground truth. To create the ground truth we
space defined by the mean landmarks l̄. In order to optimize manually annotated each frame from each sequence with
E we use the Levenberg-Marquardt algorithm. The process landmark points at 47 key facial features around the eyes,
of tracking results in a set of labeled feature based land- nose and mouth. This test would give an indication of how
marks per frame (neutral expression). These can be then stable points are over time, and whether drift occurs as re-
used to warp each image to the common coordinate frame, ported in previous work. For the AAM test, an individual
thus achieving dense non-rigid correspondence. model with 47 landmarks was trained for each sequence us-
ing 3 manually selected frames – typically at the beginning,
4.3. Sampling, Rigid Alignment and Statistical middle and end. Points were manually initialized in frame 1
Modeling (Steps 4, 5 and 6) for both the AAM and LK tests. Table 2 shows the mean Eu-
Given a set of non-rigidly aligned sequences, these are clidian error between ground truth points and tracked points
aligned again to a single common coordinate frame. This is (in pixels) for each frame. It can be seen that the AAM er-
selected to be a neutral expression from the full training se- ror is consistently lower than the LK error. Figure 3 shows
quence. The space of aligned images I3D is then uniformly examples of how the LK error accumulates over the course
sampled. This sampling defines the topology and density of tracking, supporting the optical flow drift observations
of the facial mesh, recalling that I3D (u) = x. Since each in [6, 7, 24]. This is evidence that the AAM method pro-
I3D refers to a different set of 3D points, aligning these and vides a more stable tracking approach over time, and is a
then sampling in a uniform manner results in a unique set valuable tool for reducing drift.
of registered 3D meshes. Similarly, there now also exists a We next wished to evaluate how well each method per-
common set of faces F for each mesh. formed registration of the image sequences from a qualita-
The entire set of 3D mesh data can now be rigidly aligned tive point of view. Figure 4 shows example registrations of
using Procrustes analysis (see [5] for a detailed description). peak frames to neutral frames for four sequences using (1)
Following [2] the registered 3D mesh X and UV texture data dense concatenated LK flow fields between the peak and
I may now be expressed using two PCA models: neutral frame (see Section 4.2 - Optical Flow), (2) concate-
nated LK optical flow combined with TPS regularization
X′ = X̄ + PX bX I′ = Ī + PI bI (3) (see Section 4.2 - Optical Flow and Regularization), and (3)
feature points tracked with an AAM and registered using
where X̄ is the mean mesh, Ī is the mean UV image tex-
TPS (see Section 4.2 - AAM and TPS).
ture, PX and PI are the eigenvectors of X and I, and bx
and bI are vectors of weights. The eigenvectors of X and It can be seen from Figure 4 that the LK method used
I are ordered by the proportion of total variance in the data alone produces noticeable drift artifacts. We observed that
they represent. Removing some of their columns therefore this is due to pixels overlapping each other, and is a re-
means that the projected weights bx and bI can be made sult of the flow field being concatenated over consecutive
much smaller than x and I. Rewriting (3) allows us to per- neighboring frames. One approach to avoid this in the fu-
form this parameterization to a lower dimensional space: ture may be to add a temporal constraint to the flow cal-
culation which observes learned facial deformations. The
bX = PTX (X′ − X̄) bI = PTI (I′ − Ī) (4) LK+TPS method overcomes the drawback of pixel over-
lap due to (1) tracked points being initially far apart, and
This provides a convenient lower dimensional represen-
(2) the TPS method regularizing the positions of pixels in
tation for storing dynamic facial movements and perform-
between the tracked points. Alignment is much improved
ing optimization when fitting the model to new sequences.
over LK alone. However, as highlighted by the red dotted
5. Experiments circles, accumulated optical flow drift causes some facial
features (such as the lower lip and cheeks) to distort. The
In this Section we perform baseline experiments com- AAM-TPS method provides the most stable registration, as
paring the three registration approaches described in sec- demonstrated qualitatively by an absence of drift artifacts
tion 4.2. These are (1) standard optical flow concatenation and pixel overlaps. We have also used this technique in a
perceptual face experiment [10] and participants reported registration. For future work we wish to perform exper-
no visible issues with the model. iments comparing the performance of dynamic morphable
Finally, we used the AAM-TPS approach to create a models versus static ones in a series of benchmark tests such
morphable model (see Section 4.3). We parameterized the as tracking. We would also like to combine model based ap-
original sequences using (4) and then re-synthesized them proaches such as AAMs with optical flow to improve dense
using (3). Figure 5 shows example outputs from the model. feature point registration between the tracked feature points.
In order to show how the mesh deforms smoothly with the
Acknowledgements
tracked facial features we also show corresponding exam-
We would like to thank the Royal Academy of Engineer-
ples using a UV map of a checkered pattern. The defor-
ing/EPSRC for funding this work. Also thanks to all the
mations in the pattern clearly demonstrate that the mesh is
participants in the data set, particularly the FACS experts:
following the correct facial movement.
Gwenda Simons, Kornelia Gentsch and Michaela Rohr.
60 40

50
35

30
References
40
25

[1] M. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel,


Error

Error

30 20

20
15
and J. Movellan. Automatic recognition and facial actions in
10

10
5
spontaneous behavior. Journal of Multimedia, 2006. 1
0
0 5 10

Frame
15 20 25
0
0 5 10 15

Frame No
20 25 30
[2] V. Blanz and T. Vetter. A morphable model for the synthesis
of 3d faces. In Proc. of ACM Siggraph, 1999. 1, 2, 3, 4, 5, 6
1+2+4+5+20+25 4+7+17+23
[3] V. Blanz and T. Vetter. Face recognition based on fitting a 3d
55 70 morphable model. IEEE Trans. Pattern Anal. Mach. Intell.,
50

45
60
25:1063–1074, 2003. 1, 2
50

40
[4] F. Bookstein. Principal warps: Thin-plate splines and the
Error

Error

40

decomposition of deformations. IEEE Trans. Pattern Anal.


35

30
30

25
20

Mach. Intell., 11:567–585, 1989. 3, 5


10
20

15
0 10 20 30 40 50 60
0
0 10 20 30 40 50 60
[5] F. Bookstein. Morphometric Tools for Landmark Data: Ge-
Frame No Frame
ometry and Biology. Cambridge Uni. Press, 1991. 4, 6
9+10+25 12+10 [6] G. Borshukov, D. Piponi, O. Larsen, J. Lewis, and
Figure 3. AAM-TPS (blue line) and Lucas Kanade (red line) track- C. Tempelaar-Lietz. Universal capture - image based fa-
ing errors for 4 sequences. It can be seen that the optical flow cial animation for the matrix reloaded. In ACM SIGGRAPH
method accumulates error as the sequence moves on, whereas the Sketch, 2003. 2, 5, 6
AAM value remains consistently lower. [7] D. Bradley, W. Heidrich, T. Popa, and A. Sheffer. High
resolution passive facial performance capture. ACM Trans.
AU Sequence AAM-TPS LK
Graph., 29:1–10, 2010. 2, 4, 5, 6
1+2+4+5+20+25 18.6 43.1
20+23+25 53.8 63.8 [8] F. R. G. Challenge. http://www.nist.gov/itl/iad/ig/frgc.cfm. 1
9+10+25 25.4 41.7 [9] T. Cootes, G. Edwards, and C. Taylor. Active appearance
18+25 15.3 43.6 models. IEEE Trans. Pattern Anal. Mach. Intell., 23:681–
16+10+25 23.2 38.8 685, 2001. 2, 5, 6
12+10 15.2 48.7 [10] D. Cosker, E. Krumhuber, and A. Hilton. Perception of lin-
4+7+17+23 3.4 28.5
1+4+15 2.3 26.2
ear and nonlinear motion properties using a facs validated
3d facial model. In In Proc. of ACM Applied Perception in
Table 2. AAM-TPS and Lucas Kanade mean Euclidian error val- Graphics and Visualisation, pages 101–108, 2010. 7
ues (in pixels) for tracked feature points versus ground truth land- [11] J. Duncan. The unusual birth of benjamin button. Cinefex,
mark points. 8 dynamic AU sequences were tracked in this partic- 2009. 1
ular test. The result demonstrates the improved reliability of the [12] P. Ekman, W. Friesen, and J. Hager. Facial Action Coding
AAM tracking method over the optical flow approach. System: Second Edition. Salt Lake City: Research Nexus
eBook, 2002. 1, 2, 3, 4
6. Conclusion and Future Work [13] Y. Furukawa and J. Ponce. Dense 3d motion capture for hu-
man faces. In In Proc. of IEEE Computer Vision and Pattern
In this paper we have presented the first dynamic 3D
Recognition (CVPR), pages 1674–1681, 2009. 1, 2
FACS data set (D3DFACS) for facial expression research.
[14] A. Georghiades, P. Belhumeur, and D. Kriegman. From few
The corpus is fully FACS coded and contains 10 partici- to many: Illumination cone models for face recognition un-
pants performing a total of 534 AU sequences. We also der variable lighting and pose. IEEE Trans. Pattern Anal.
proposed a framework for building dynamic 3D morphable Mach. Intelligence, 23(6):643–660, 2001. 1
facial models and described an AAM based approach for [15] P. Gosselin, M. Perron, and M. Beaupre. The voluntary con-
non-rigid 3D mesh registration. Our experiments show that trol of facial action units in adults. Emotion, 10(2):266–271,
the approach has several advantages over optical flow based 2010. 2
Figure 4. Peak frames registered to neutral frames using LK, LK+TPS and AAM+TPS (see Section 4.2). In each case the concatentated
sequence information between the peak and neutral frame is used for registration. Red circles highlight drift errors in the LK+TPS approach.

Figure 5. Outputs from a morphable model constructed using the AAM+TPS method: (left to right) Neutral, 9+10+25, 20+23+25, 12+10
and 16+10+25. The checker pattern highlights the underlying mesh deformation.

[16] B. Lucas and T. Kanade. An iterative image registration tech- pages 317–321, 2005. 1
nique with an application to stereo vision. In In Proc. of [20] A. Patel and W. Smith. 3d morphable face models revisited.
Image Understanding Workshop, pages 121–130, 1981. 5 In In Proc. of IEEE Computer Vision and Pattern Recogni-
[17] P. Lucey, J. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and tion (CVPR), pages 1327–1334, 2009. 1, 2, 3, 5
I. Matthews. The extended cohn-kanade dataset (ck+): A [21] D. Systems. http://www.3dmd.com. 3
complete dataset for action unit and emotion-specified ex- [22] L. Yin, X. Chen, Y. Sun, T. Worm, and M. Reale. A high-
pression. In In Proc. of IEEE Computer Vision and Pattern resolution 3d dynamic facial expression database. In In Proc.
Recognition (CVPR), pages 94–101, 2010. 1, 2, 6 of Int. Conf. on Auto. Face and Gesture Recog., 2008. 1, 2
[18] W. Ma, A. Jones, J. Chiang, T. Hawkins, S. Frederiksen, [23] L. Yin, X. Wei, Y. Sun, J. Wang, and M. Rosato. A 3d facial
P. Peers, M. Vukovic, M. Ouhyoung, and P. Debevec. Facial expression database for facial behavior research. In In Proc.
performance synthesis using deformation-driven polynomial of Int. Conf. on Auto. Face and Gesture Recog., 2006. 1
displacement maps. ACM Tran. Graph., 27(5):1–10, 2008. 2 [24] L. Zhang, N. Snavely, B. Curless, and S. Seitz. Spacetime
[19] M. Pantic, M. Valstar, R. Rademaker, and L. Matt. Fully faces: high resolution capture for modeling and animation.
automatic facial recognition in spontaneous behavior. In In. ACM Trans. Graph., 23(3):548–558, 2004. 2, 4, 5, 6
Proc of International Conference on Multimedia and Expo,

You might also like