Facs 3dmm
Facs 3dmm
Facs 3dmm
Adrian Hilton
Centre for Vision, Speech and Signal Processing
University of Surrey
[email protected]
Table 1. AU frequencies identified by manual FACS coders in the D3DFACS data set (based on FACS descriptions in Ekman et al [12]).
Figure 2. Examples from the D3DFACS data set. The top two rows show camera views from 6 participants. The bottom two rows show
3D mesh data (textured and un-textured), and corresponding UV texture maps.
sequence frames. In this paper we propose a dense AAM linear texture PCA model (see Section 4.2)
based approach to achieve registration and compare to state Step 4: Regularly sample the UV space to calculate 3D
of the art approaches. Directly applying optical flow for vertices for each corresponding mesh. The more accurate
registration without a mesh regularization term (as in [2]) the pixel based registration is, the more accurate the mesh
produces drift artifacts in the meshes and images. Even with correspondence (see Section 4.2).
a regularization term (as in [7, 24]) tracking accuracy still Step 5: Perform rigid registration of the 3D mesh data.
depends on optical flow quality which can be error prone Since sequences at this point have 3D mesh correspon-
(see Section 4.2). Registration is with respect to a neutral dence, Procrustes analysis [5] may be applied to align the
expression image selected from the sequence. meshes in an efficient manner. This removes head pose vari-
Step 3: Perform global non-rigid registration of the dy- ation in the dynamic sequences.
namic sequences. One of the neutral sequence poses is Step 6: Build linear PCA models for shape and texture
chosen as a global template to which each of the UV se- using the registered 3D mesh and UV texture data.
quences is then registered using a single dense warping per We now expand on the above process concentrating pri-
sequence. This registered UV space provides data for the marily on the procedures for non-rigid registration.
4. 3D Registration and Correspondence Zhang et al [24] and Borshukov et al [6] use optical flow
from stereo image pairs to update a mesh through a se-
4.1. Creating a 2D to 3D Mapping (Step 1) quence. They use this technique for animation applications.
A sequence of data consists of a set of meshes The mesh is initialized in frame 1, and its vertices moved
X = [X1 , . . . Xn ], where X = [xT1 , xT2 . . . xTm ]T , xi = to optimal positions in successive frames using flow vec-
[xix , xiy , xiz ]T ∈ R3 . There also exists a set of UV tex- tors merged from each stereo view. The update is also
ture maps I = [I1 . . . In ], and a set of UV coordinates combined with a mesh regularization constraint to avoid
U = [U1 , U2 , . . . Un ], where U = [uT1 , uT2 , uTm ]T , and flipped faces. We extend this approach by using a single
ui = [ui , vi ] ∈ R2 . The UV texture maps supply color UV space for optical flow calculation and mesh updating as
data to the mesh in the form of images, with the UV co- opposed to merging stereo flow fields. For regularization,
ordinates linking individual vertices xi to unique points on we sparsely sample the flow field and interpolate the po-
these images ui . In the above definitions, n is the number sitions of in-between points using TPS warping (see AAM
of meshes and corresponding UV maps in a sequence. Sim- and TPS next). This ensures that flow vectors follow the
ilarly, m is the number of vertices in a mesh and the number behavior of the sparse control points, but as with previous
of corresponding UV coordinates. approaches does not guarantee against tracking errors accu-
There also exists a set of common triangular faces per mulating due to optical flow drift.
mesh Fi , i = 1 . . . n, where faces in the 3D vertex space AAM and TPS: Patel and Smith [20] achieve corre-
correspond to the same faces in the 2D texture space. The spondence in 3D morphable model construction by man-
entire set of faces for a sequence may also be defined as ually landmarking 3D images and aligning them using TPS
F = [F1 , . . . Fn ]. based warping. TPS is a type of Radial Basis Function
We approach 3D correspondence as a 2D image regis- (RBF). RBFs can be used to define a mapping between any
tration problem. From a theoretical point of view, perfect point defined with respect to a set of basis control points
one-to-one pixel registration between successive face im- (e.g. landmarks in one image), and its new position given
ages relates to perfect 3D mesh correspondence. The goal a change in the control points (e.g. landmarks in the tar-
is to achieve as near an optimal correspondence as possible. get image). Thus, a dense mapping between pixels in two
It is therefore useful from an implementation point of images may be defined. TPS itself provides a kernel that
view to work primarily in 2D space. We first generate 3D models this mapping based on a physical bending energy
images I3D (u) = x. This is achieved by taking each face in term (for more detail see [4]).
turn, and for each pixel within its triangle in 2D space cal- This approach has several advantages over optical flow
culating the corresponding 3D position using a Barycentric based correspondence: (1) the TPS warp provides a smooth
coordinate mapping. Repeating for each triangle results in a and dense warping field with no drift artifacts in the mesh
dense 2D pixel to 3D vertex mapping for the entire UV map. or images, and (2) manual point placement guarantees cor-
Operations performed on I are from now on also applied to respondence of key facial features whereas optical flow is
I3D , including optical flow and TPS warping. prone to drift. We extend this to dynamic sequences by
building an AAM and using it to automatically track feature
4.2. Non-Rigid Alignment (Steps 2 and 3) points through a dynamic sequence of multiple UV maps.
Pixels (or (u, v) coordinates) within the control points are
We now describe several strategies for non-rigid align-
corresponded with those in neighboring frames using the
ment, including our proposed method. In Section 5 we then
TPS mapping, which warps each I and I3D to a common
provide experimental results comparing these.
coordinate frame (the neutral expression).
Optical Flow: Blanz and Vetter [2] calculate smoothed
AAMs are well known in the computer vision literature,
optical flow to find corresponding features between images and a thorough overview may be found in [9]. We use the
of 200 different people. However, the formulation and
same principles here and define our AAM as:
choice of features is tuned to the particular data. In this
work we consider a more standardized approach and extend l = l̄ + Pl WQl c g = ḡ + Pg Qg c (1)
to dynamic sequences. We calculate concatenated Lukas-
Kanade (LK) [16] flow fields that warp images between I+i where l is a vector of image landmarks, l̄ are the mean land-
and I0 , where I0 is the neutral expression image (UV map). marks learned from training, g is a vector of image pixels in-
Flow is summed for the images between I + i and I0 , pro- side the region defined by l, and ḡ are the mean image pixels
viding the concatenated flow. Smoothing of the flow field learned from training. The eigenvectors of the training sets
is applied in the form of local averaging in both the spatial of vectors l and g are the matrices Pl and Pg respectively.
and temporal domains. Flow fields calculated from the UV The matrix W is a set of scaling weights, the matrix Q rep-
maps are also then applied to the I3D images. resents the eigenvectors of the joint distribution of landmark
Optical Flow and Regularization: Bradley et al [7], and image data, and c is the appearance parameter.
Fitting AAMs to new images is a well covered topic in which extends [2], (2) a combined optical flow and regular-
the computer vision literature (see [9, 17]). In this work we ization approach similar to [7, 24], and (3) the new AAM-
define a simple minimization approach which compares the TPS combination approach proposed in this paper.
current guess to the models best reconstruction of this: For test purposes we selected 8 dynamic AU sequences
from our data set consisting of approximately 65 frames
E = min(gl − (PT (gl − ḡ)) (2)
c each. For optical flow we use the pyramidal Lucas-Kanade
where gl is portion of the image I within the area defined by (LK) algorithm as in [7]. We first wished to compare how
l (the current guess). Calculating gl requires first calculat- well the AAM and LK algorithms tracked facial feature
ing l using c (in (1)), and then warping this region into the points versus a ground truth. To create the ground truth we
space defined by the mean landmarks l̄. In order to optimize manually annotated each frame from each sequence with
E we use the Levenberg-Marquardt algorithm. The process landmark points at 47 key facial features around the eyes,
of tracking results in a set of labeled feature based land- nose and mouth. This test would give an indication of how
marks per frame (neutral expression). These can be then stable points are over time, and whether drift occurs as re-
used to warp each image to the common coordinate frame, ported in previous work. For the AAM test, an individual
thus achieving dense non-rigid correspondence. model with 47 landmarks was trained for each sequence us-
ing 3 manually selected frames – typically at the beginning,
4.3. Sampling, Rigid Alignment and Statistical middle and end. Points were manually initialized in frame 1
Modeling (Steps 4, 5 and 6) for both the AAM and LK tests. Table 2 shows the mean Eu-
Given a set of non-rigidly aligned sequences, these are clidian error between ground truth points and tracked points
aligned again to a single common coordinate frame. This is (in pixels) for each frame. It can be seen that the AAM er-
selected to be a neutral expression from the full training se- ror is consistently lower than the LK error. Figure 3 shows
quence. The space of aligned images I3D is then uniformly examples of how the LK error accumulates over the course
sampled. This sampling defines the topology and density of tracking, supporting the optical flow drift observations
of the facial mesh, recalling that I3D (u) = x. Since each in [6, 7, 24]. This is evidence that the AAM method pro-
I3D refers to a different set of 3D points, aligning these and vides a more stable tracking approach over time, and is a
then sampling in a uniform manner results in a unique set valuable tool for reducing drift.
of registered 3D meshes. Similarly, there now also exists a We next wished to evaluate how well each method per-
common set of faces F for each mesh. formed registration of the image sequences from a qualita-
The entire set of 3D mesh data can now be rigidly aligned tive point of view. Figure 4 shows example registrations of
using Procrustes analysis (see [5] for a detailed description). peak frames to neutral frames for four sequences using (1)
Following [2] the registered 3D mesh X and UV texture data dense concatenated LK flow fields between the peak and
I may now be expressed using two PCA models: neutral frame (see Section 4.2 - Optical Flow), (2) concate-
nated LK optical flow combined with TPS regularization
X′ = X̄ + PX bX I′ = Ī + PI bI (3) (see Section 4.2 - Optical Flow and Regularization), and (3)
feature points tracked with an AAM and registered using
where X̄ is the mean mesh, Ī is the mean UV image tex-
TPS (see Section 4.2 - AAM and TPS).
ture, PX and PI are the eigenvectors of X and I, and bx
and bI are vectors of weights. The eigenvectors of X and It can be seen from Figure 4 that the LK method used
I are ordered by the proportion of total variance in the data alone produces noticeable drift artifacts. We observed that
they represent. Removing some of their columns therefore this is due to pixels overlapping each other, and is a re-
means that the projected weights bx and bI can be made sult of the flow field being concatenated over consecutive
much smaller than x and I. Rewriting (3) allows us to per- neighboring frames. One approach to avoid this in the fu-
form this parameterization to a lower dimensional space: ture may be to add a temporal constraint to the flow cal-
culation which observes learned facial deformations. The
bX = PTX (X′ − X̄) bI = PTI (I′ − Ī) (4) LK+TPS method overcomes the drawback of pixel over-
lap due to (1) tracked points being initially far apart, and
This provides a convenient lower dimensional represen-
(2) the TPS method regularizing the positions of pixels in
tation for storing dynamic facial movements and perform-
between the tracked points. Alignment is much improved
ing optimization when fitting the model to new sequences.
over LK alone. However, as highlighted by the red dotted
5. Experiments circles, accumulated optical flow drift causes some facial
features (such as the lower lip and cheeks) to distort. The
In this Section we perform baseline experiments com- AAM-TPS method provides the most stable registration, as
paring the three registration approaches described in sec- demonstrated qualitatively by an absence of drift artifacts
tion 4.2. These are (1) standard optical flow concatenation and pixel overlaps. We have also used this technique in a
perceptual face experiment [10] and participants reported registration. For future work we wish to perform exper-
no visible issues with the model. iments comparing the performance of dynamic morphable
Finally, we used the AAM-TPS approach to create a models versus static ones in a series of benchmark tests such
morphable model (see Section 4.3). We parameterized the as tracking. We would also like to combine model based ap-
original sequences using (4) and then re-synthesized them proaches such as AAMs with optical flow to improve dense
using (3). Figure 5 shows example outputs from the model. feature point registration between the tracked feature points.
In order to show how the mesh deforms smoothly with the
Acknowledgements
tracked facial features we also show corresponding exam-
We would like to thank the Royal Academy of Engineer-
ples using a UV map of a checkered pattern. The defor-
ing/EPSRC for funding this work. Also thanks to all the
mations in the pattern clearly demonstrate that the mesh is
participants in the data set, particularly the FACS experts:
following the correct facial movement.
Gwenda Simons, Kornelia Gentsch and Michaela Rohr.
60 40
50
35
30
References
40
25
Error
30 20
20
15
and J. Movellan. Automatic recognition and facial actions in
10
10
5
spontaneous behavior. Journal of Multimedia, 2006. 1
0
0 5 10
Frame
15 20 25
0
0 5 10 15
Frame No
20 25 30
[2] V. Blanz and T. Vetter. A morphable model for the synthesis
of 3d faces. In Proc. of ACM Siggraph, 1999. 1, 2, 3, 4, 5, 6
1+2+4+5+20+25 4+7+17+23
[3] V. Blanz and T. Vetter. Face recognition based on fitting a 3d
55 70 morphable model. IEEE Trans. Pattern Anal. Mach. Intell.,
50
45
60
25:1063–1074, 2003. 1, 2
50
40
[4] F. Bookstein. Principal warps: Thin-plate splines and the
Error
Error
40
30
30
25
20
15
0 10 20 30 40 50 60
0
0 10 20 30 40 50 60
[5] F. Bookstein. Morphometric Tools for Landmark Data: Ge-
Frame No Frame
ometry and Biology. Cambridge Uni. Press, 1991. 4, 6
9+10+25 12+10 [6] G. Borshukov, D. Piponi, O. Larsen, J. Lewis, and
Figure 3. AAM-TPS (blue line) and Lucas Kanade (red line) track- C. Tempelaar-Lietz. Universal capture - image based fa-
ing errors for 4 sequences. It can be seen that the optical flow cial animation for the matrix reloaded. In ACM SIGGRAPH
method accumulates error as the sequence moves on, whereas the Sketch, 2003. 2, 5, 6
AAM value remains consistently lower. [7] D. Bradley, W. Heidrich, T. Popa, and A. Sheffer. High
resolution passive facial performance capture. ACM Trans.
AU Sequence AAM-TPS LK
Graph., 29:1–10, 2010. 2, 4, 5, 6
1+2+4+5+20+25 18.6 43.1
20+23+25 53.8 63.8 [8] F. R. G. Challenge. http://www.nist.gov/itl/iad/ig/frgc.cfm. 1
9+10+25 25.4 41.7 [9] T. Cootes, G. Edwards, and C. Taylor. Active appearance
18+25 15.3 43.6 models. IEEE Trans. Pattern Anal. Mach. Intell., 23:681–
16+10+25 23.2 38.8 685, 2001. 2, 5, 6
12+10 15.2 48.7 [10] D. Cosker, E. Krumhuber, and A. Hilton. Perception of lin-
4+7+17+23 3.4 28.5
1+4+15 2.3 26.2
ear and nonlinear motion properties using a facs validated
3d facial model. In In Proc. of ACM Applied Perception in
Table 2. AAM-TPS and Lucas Kanade mean Euclidian error val- Graphics and Visualisation, pages 101–108, 2010. 7
ues (in pixels) for tracked feature points versus ground truth land- [11] J. Duncan. The unusual birth of benjamin button. Cinefex,
mark points. 8 dynamic AU sequences were tracked in this partic- 2009. 1
ular test. The result demonstrates the improved reliability of the [12] P. Ekman, W. Friesen, and J. Hager. Facial Action Coding
AAM tracking method over the optical flow approach. System: Second Edition. Salt Lake City: Research Nexus
eBook, 2002. 1, 2, 3, 4
6. Conclusion and Future Work [13] Y. Furukawa and J. Ponce. Dense 3d motion capture for hu-
man faces. In In Proc. of IEEE Computer Vision and Pattern
In this paper we have presented the first dynamic 3D
Recognition (CVPR), pages 1674–1681, 2009. 1, 2
FACS data set (D3DFACS) for facial expression research.
[14] A. Georghiades, P. Belhumeur, and D. Kriegman. From few
The corpus is fully FACS coded and contains 10 partici- to many: Illumination cone models for face recognition un-
pants performing a total of 534 AU sequences. We also der variable lighting and pose. IEEE Trans. Pattern Anal.
proposed a framework for building dynamic 3D morphable Mach. Intelligence, 23(6):643–660, 2001. 1
facial models and described an AAM based approach for [15] P. Gosselin, M. Perron, and M. Beaupre. The voluntary con-
non-rigid 3D mesh registration. Our experiments show that trol of facial action units in adults. Emotion, 10(2):266–271,
the approach has several advantages over optical flow based 2010. 2
Figure 4. Peak frames registered to neutral frames using LK, LK+TPS and AAM+TPS (see Section 4.2). In each case the concatentated
sequence information between the peak and neutral frame is used for registration. Red circles highlight drift errors in the LK+TPS approach.
Figure 5. Outputs from a morphable model constructed using the AAM+TPS method: (left to right) Neutral, 9+10+25, 20+23+25, 12+10
and 16+10+25. The checker pattern highlights the underlying mesh deformation.
[16] B. Lucas and T. Kanade. An iterative image registration tech- pages 317–321, 2005. 1
nique with an application to stereo vision. In In Proc. of [20] A. Patel and W. Smith. 3d morphable face models revisited.
Image Understanding Workshop, pages 121–130, 1981. 5 In In Proc. of IEEE Computer Vision and Pattern Recogni-
[17] P. Lucey, J. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and tion (CVPR), pages 1327–1334, 2009. 1, 2, 3, 5
I. Matthews. The extended cohn-kanade dataset (ck+): A [21] D. Systems. http://www.3dmd.com. 3
complete dataset for action unit and emotion-specified ex- [22] L. Yin, X. Chen, Y. Sun, T. Worm, and M. Reale. A high-
pression. In In Proc. of IEEE Computer Vision and Pattern resolution 3d dynamic facial expression database. In In Proc.
Recognition (CVPR), pages 94–101, 2010. 1, 2, 6 of Int. Conf. on Auto. Face and Gesture Recog., 2008. 1, 2
[18] W. Ma, A. Jones, J. Chiang, T. Hawkins, S. Frederiksen, [23] L. Yin, X. Wei, Y. Sun, J. Wang, and M. Rosato. A 3d facial
P. Peers, M. Vukovic, M. Ouhyoung, and P. Debevec. Facial expression database for facial behavior research. In In Proc.
performance synthesis using deformation-driven polynomial of Int. Conf. on Auto. Face and Gesture Recog., 2006. 1
displacement maps. ACM Tran. Graph., 27(5):1–10, 2008. 2 [24] L. Zhang, N. Snavely, B. Curless, and S. Seitz. Spacetime
[19] M. Pantic, M. Valstar, R. Rademaker, and L. Matt. Fully faces: high resolution capture for modeling and animation.
automatic facial recognition in spontaneous behavior. In In. ACM Trans. Graph., 23(3):548–558, 2004. 2, 4, 5, 6
Proc of International Conference on Multimedia and Expo,