1 Introduction
Image-based animation of
facial expressions
Gideon Moiza1 , Ayellet Tal1 ,
Ilan Shimshoni2 , David Barnett1,
Yael Moses3
1 Department
of Electrical Engineering,
Technion – IIT, Haifa, 32000, Israel
E-mail:
[email protected]
2 Department of Industrial Engineering,
Technion – IIT, Haifa, 32000, Israel
E-mail:
[email protected]
3 Department of Computer Science, Interdisciplinary
Center, Herzeliya, Israel
E-mail:
[email protected]
Published online: 2 October 2002
c Springer-Verlag 2002
We present a novel technique for creating realistic
facial animations given a small number of real images and a few parameters for the in-between images.
This scheme can also be used for reconstructing facial
movies where the parameters can be automatically extracted from the images. The in-between images are
produced without ever generating a three-dimensional
model of the face. Since facial motion due to expressions are not well defined mathematically our approach is based on utilizing image patterns in facial
motion. These patterns were revealed by an empirical study which analyzed and compared image motion
patterns in facial expressions. The major contribution
of this work is showing how parameterized “ideal”
motion templates can generate facial movies for different people and different expressions, where the parameters are extracted automatically from the image
sequence. To test the quality of the algorithm, image
sequences (one of which was taken from a TV news
broadcast) were reconstructed, yielding movies hardly
distinguishable from the originals.
Key words: Facial animation – Facial expression reconstruction – Image morphing
Correspondence to: A. Tal
Work has been supported in part by the Israeli Ministry of Industry and Trade, The MOST Consortium
The Visual Computer (2002) 18:445–467
Digital Object Identifier (DOI) 10.1007/s003710100157
The human face is one of the most complex and interesting objects that we come across on a regular basis.
The face, and the myriad expressions and gestures
that it is capable of making, are a key component
of human interaction and communication. People are
extremely adept at recognizing faces. This attribute
presents both an advantage and a challenge to any
system that manipulates facial images. The viewer
is likely to be able to instantly spot any defects or
shortcomings in the image. If the image is not a perfect rendition of an actual face, both in appearance
and in motion, the user will notice the discrepancies.
Therefore, any facial application needs to be highly
accurate if it is to be successful.
In this paper we explore the issue of constructing images of facial expressions. Our method uses
a small number of full frames. In addition, for each
in-between frame, a small number of points on contours in the image are sufficient to describe the shape
and the location of facial features and to generate
the frames. We will show that our method is capable of generating an image sequence in a faithful and
complete manner.
This method can be used for producing computer
graphics animations, for intelligent compression of
video-conferencing systems and other low-bandwidth video applications, for intelligent man–machine interfaces, and for video databases of facial
images or animations. Though the goal is to produce animations, one plausible way to test the quality
of the technique is to sample a real facial movie,
reconstruct it with our algorithm, and compare the
two movies. This suggests that the same method
can be used for the compression of facial movies.
In this case, the control parameters can be extracted
automatically from the original movies using stateof-the-art tracking systems [e.g., Blake and Isard
(1994); Moses et al. (1995)].
Most of the techniques for facial animations utilized in computer graphics are based on modeling
the three-dimensional structure of a human face and
rendering it using some reflectance properties [for
a survey, see Parke and Waters (1996)]. These techniques are the state-of-the-art for facial animations
and generate extremely compelling results. Geometric interpolation between the facial models is used
in Parke (1972), where the models are digitized by
hand. Measurements of real actors are used in Bergeron and Lachapelle (1985), Essa et al. (1996), and
Williams (1990). The system described in Guenter et
al. (1998) captures the facial expressions in three di-
446
G. Moiza et al.: Image-based animation of facial expressions
mensions and can replay a three-dimensional polygonal face model with a changing texture map. The
process begins with a video of a live actor’s face.
Meshes representing the face are used jointly with
models of the skin and the muscles in Waters (1987),
Terzopoulos and Waters (1990), Lee et al. (1993),
and Lee et al. (1995b). The system of Cassell et al.
(1994) is designed to automatically animate a conversation between human-like agents. Generating
new face geometries automatically, depending on
a mathematical description of possible face geometries, is proposed in DeCarlo et al. (1998). It has been
shown in Pighin et al. (1998) how two-dimensional
morphing techniques can be combined with threedimensional transformations of a geometric model
to automatically produce three-dimensional facial
expressions.
We propose a different approach which avoids modeling and rendering in three dimensions. It is thus
less expensive. Instead, we use as a basis a set of
real images of the face in question, and we create the
animations in the image domain by producing the inbetween images.
In this regard, our work is more related to the image morphing approach, such as described in Wolberg (1990), Beier and Neely (1992), and Lee et
al. (1995a), which has proven to be very effective.
One problem with most of this work, however, is
that the end user is required to specify dozens of
carefully chosen parameters. Moreover, these methods are more appropriate for morphing one object
into another, where the object can be a person.
Since the in-between person is not known, errors
are more tolerable. Our goal, however, is different. We want to generate a movie of a specific
person given a few frames from a movie of this
person.
In Bergler et al. (1997), a related though different
problem is discussed. The mouth regions are morphed in order to lip-synch existing video to a novel
soundtrack. This method is different from ours since
it combines footage from two video sequences, while
we generate the sequence artificially. In Ezzat and
Poggio (1998), an animated talking head system is
described, where morphing techniques are used to
combine visemes taken from an existing corpus. This
method is different from ours since we generate the
visemes artificially. In addition, in Ezzat and Poggio
(1998) the inter-viseme video sequence is generated
by optical flow while we use model-based optical
flow as discussed below.
Related work has also been done in the field of
computer vision, where image-based approaches
are utilized, bypassing three-dimensional models. In
Cootes et al. (1998) a face’s shape and appearance
is described by a large number of parameters. Given
the parameters of a new image, the face can be reconstructed. This approach differs from ours as we use
information from neighboring images in the movie in
order to reconstruct the in-between images. As a result, not only are many fewer parameters needed, but
also the resulting images look sharper as we avoid
warping. In Beymer and Poggio (1996), optical flow
of a given image to a base image is utilized. Both
texture vectors and shape vectors, which form separate linear vector spaces, are used. The advantage of
this approach is that optical flow is done automatically and is well explored. In addition, it is general
and is not model-based. A consequence of this generality is that the resulting images are not as sharp as
the approach we are pursuing, since optical flow is
not always accurate. One way to look at our approach
is as a model-based optical flow, where the model of
the optical flow has been discovered empirically and
is fixed for each expression.
Generally, the animation of faces requires handling
the effects of the viewing position, the illumination conditions, and the facial expressions. There has
been a lot of work done that considers the effects of
viewing position and illumination, mostly in computer vision [for a few examples, see Ullman and
Basri (1991); Moses (1994); Shashua (1992); Moses
(1994); Seitz and Dyer (1996); Vetter and Poggio
(1997); Sali and Ullman (1998)]. Unlike the changes
in the viewing position, which can be described by
rigid transformations, the nonrigid transformations
determining human expressions are not well defined
mathematically. One way to handle these transformations is to learn from examples the set of transformations of a face while talking or changing expressions.
In this paper we take this avenue while focusing on
these nonrigid transformations. We have been empirically studying image patterns in facial expressions.
Though skin deformation can result from complex
muscle actions and bone motion, our experimentations revealed that patterns do exist in images, both
with respect to the same person on different occasions, and also between different people performing
the same expression. Moreover, these patterns can
be expressed in a rather simple way. In addition, our
experimentations showed that there is a strong cor-
G. Moiza et al.: Image-based animation of facial expressions
447
Fig. 1. Sample frames showing dot placement and movement
relation between the motion of regions in a face and
the motion of a small number of contours bordering
them.
Thus, the major contribution of this work is showing how, based on empirical results, parameterized
motion templates can generate facial movies. These
parameters are extracted automatically from the image sequence. We show that given very few images
and a few image parameters, the full sequence can
be reconstructed. Our technique is combined with
existing techniques dealing with changes in viewing
position, yielding a system that deals both with head
motion and facial expressions in a faithful manner.
We ran our algorithm on several movies of people
performing various expressions, including a movie
of a TV broadcaster. In order to evaluate the quality
of our algorithm, we compared our artificial movies
to the original movies. The movies were hardly distinguishable. On average, one full frame in fifteen
is need. If we take into account the other information we use, the compression rate is 1:13. It is
possible to combine our animation algorithm with
standard compression schemes to get higher compression rates. Using our algorithm above MPEG2
reduces the size of the movie by a factor of 4 with
respect to MPEG2. When the algorithm is used
for compression, an accurate tracking system is
required.
The rest of this paper is organized as follows. In
Sect. 2 we describe the empirical experiments we
conducted and draw conclusions. The goal of the experiments was to find image motion patterns that occur during facial expressions. In Sect. 3 we describe
our algorithm for animating facial expressions, using the empirical analysis. In Sect. 4 we present the
experimental results produced by our animation algorithm. We conclude in Sect. 5.
2 Empirical analysis of facial
expressions
In this section we describe our experiments for finding image patterns in facial expressions. The goal of
the empirical research was to find parameters that
can characterize the various expressions and which
can be used to produce animations. Thus, our empirical research was guided by two main objectives. The
first was to see how much variance there was in the
way a person performed the same expression on different runs. Then, we wanted to determine if there
was a correlation between the way different people
perform the same expression.
In order to provide a broad enough sample base, five
subjects were selected, and each expression was repeated five times. Expressions representative of the
types encountered in normal conversational speech
were selected. First, a simple mouth open–close operation was performed, followed by the sounds ‘oh’,
‘oo’, and ‘ee’. A smile and a frown were then performed, as examples of non-speech expressions that
commonly occur during a normal conversation. In
between the execution of subsequent expressions,
the subject returned to a neutral position, with mouth
closed.
In order to track the motion, small uniformly colored stickers were attached to the subjects’ faces.
The stickers were applied in such a way as to maximize the resolution in the areas of the face that were
expected to move the most. The application pattern
can be seen in Fig. 1.
Once the data were collected, several stages of analysis were performed in order to chart the motion of
the dots over time. First, the data points associated
with each sticker were grouped together. Singular
448
G. Moiza et al.: Image-based animation of facial expressions
Fig. 2. Vector graphs for the open–close expression for a single
subject
value decomposition (SVD) was performed on each
set of points, in order to determine the eigendirection
in which each dot moved. The SVD of a matrix X
yields three matrices, such that
X = UΣV T .
Σ is a diagonal matrix holding the singular values of
X. The first column of U holds the normalized x and
y components of the directional vector which is the
principle axis of an ellipsoid drawn to best fit the data
points (Duda and Hart 1973). The SVD was used to
obtain a normalized vector indicating the direction in
which the maximum variance of the data points was
found. The lengths of the vectors were determined
by projecting the data points along the directional
eigenvector.
The vector data were used to create one graph per
run, per person, per expression. In addition to the
individual graphs, graphs displaying the mean and
standard deviation of the motion per person for each
expression were generated. See, for example, the
graphs generated for the open–close expression for
a single subject in Fig. 2. Finally, the data from all the
runs of each expression were averaged, resulting in
six graphs displaying the overall results across subjects, as shown in Fig. 3. The mean is drawn in blue
and the standard deviation in red. This allowed us to
see if different people perform the same expression
the same way.
After examining the data, several important observations can be made. The most important result is
that clear patterns of motion emerge for the different expressions. The uniformity means that it is
safe to make the assumption that at least within certain bounds, people perform the same expression
the same way. Therefore, algorithms can be developed based on these results, and they should be applicable to the majority of people performing these
expressions.
Variations in the motion were also observed, both
between different runs by the same person as well
as between different subjects. These variations can
be attributed to several sources. First, the subjects
were human, and not robotic automata, and therefore
exhibited a certain degree of individuality. For this
same reason, they did not always perform the same
expression the same way. Variations were largest for
those expressions which allowed the greatest freedom in the way they could be performed. For example, although there is only one way to open and close
the mouth, the ‘oo’ sound can be made with mouth
open or closed, lips pursed or relaxed.
The results can best be described by dealing separately with the various expressions and the characteristics observed in the way the different subjects
performed them.
The graphs provide a fairly intuitive way to understand how the face moved while performing the various expressions. The vector associated with each
sticker is positioned so that its midpoint corresponds
to the center of a bounding box drawn around the
maximum extents of the point’s motion. It should
be noted that the eigendirection vectors describe the
overall dot movement, but do not necessarily provide
a complete picture of the dot’s trajectory over time.
For the graphs involving data averaged over several
runs, three vectors were drawn. The same length,
namely the mean, was used for all three, with the direction changing to illustrate the mean ± standard
deviation. All three vectors are drawn with their centers located at the average midpoint of the various
runs. We summarize our findings for the six expressions below.
• Open–close: Of the expressions, open–close was
the easiest to recognize from the graphs, because
of the large vertical displacement of the points on
the chin. (This motion, however, also made it difficult to track, because the locations of the points
on the chin overlapped as the mouth was opened.)
• OO: Generally, the ‘oo’ expression was characterized by small motions. The corners of the
mouth move in, and the mouth opens slightly.
G. Moiza et al.: Image-based animation of facial expressions
-100
449
-100
-100
-50
-50
-50
0
0
0
50
50
50
100
100
100
150
150
150
200
200
200
250
250
250
300
350
-150
300
-100
-50
0
50
100
150
300
-150
-100
-50
0
50
100
150
350
-150
a
b
c
-100
-100
-100
-50
-50
-50
0
0
0
50
50
50
100
100
100
150
150
150
200
200
200
250
250
250
300
-150
d
-100
-50
0
50
100
150
300
-150
-100
-50
e
0
50
100
150
300
-150
-100
-50
0
50
100
150
-100
-50
0
50
100
150
f
Fig. 3a–f. Vector graphs: a Vector graph for average open–close; b Vector graph for average Oo; c Vector graph for average
Oh; d Vector graph for average Ee; e Vector graph for average smile; f Vector graph for average frown
Some of the subjects barely moved their mouths
at all, others clearly pursed their lips.
• OH: The ‘oh’ was a cross between the OpenClose and ‘oo’. The mouth opened, and the edges
of the mouth came slightly together.
• EE: The ‘ee’ can be considered to be the “opposite” of the ‘oo’. The corners of the mouth move
out as opposed to in. Again, the mouth opens
slightly.
• Smile: In the smile expression, the corners of the
mouth move up and out. This was the one expression where the points under the eyes moved
a significant amount, almost straight up and
down.
450
G. Moiza et al.: Image-based animation of facial expressions
• Frown: The frown was the expression that the
subjects had the most difficulty with. Different
subjects performed completely different motions
under the guise of ‘frown’. There was also a variance between runs of the same person. In addition, it was the most difficult expression to analyze. Different points tended to move in different
directions, even points in close proximity to each
other.
Ideally, each expression would be performed in
a unique manner, the motion of the points would be
large, linear, and smooth, and the motion would be
symmetrical between the left and right sides of the
face. In practice, however, there were variations between the way the different subjects performed the
expressions. In general, the subjects exhibited various aspects of the following categories. There was
no single person who completely fit a single profile. Nevertheless, such a classification is useful, as it
allows us to focus on the types of difficulties encountered when video of real people is used as input.
• Nonlinear: The first group that deviated from the
ideal were those who exhibited nonlinear motion.
The points follow curved trajectories, or else exhibit hysteresis (i.e., the point does not follow the
same path when closing the mouth as it did when
opened).
• Nonsymmetric: Many tasks would be made easier if everyone moved their faces symmetrically.
If that were the case, then we could compute the
motions from one side of the face and merely reflect them for the other side. Unfortunately, this
was not always the case.
• Inconsistent: All of the subjects were inconsistent
to a certain degree between runs of the same expression. For instance, sometimes the jaw would
shift to the left, other times to the right. In some
cases, the mouth was opened for the ‘oo’, other
times not. Of the expressions, the frown produced
the least consistent results, with sometimes wild
fluctuations from run to run.
One of the most interesting results is how little
the upper part of the face moves during most of
the expressions. (There are other expressions, however, in which the eyebrows move as well.) This
implies that we can cut communication and processing costs by concentrating on the part of the
face that does move, and using a much coarser algorithm for the upper face. These savings could
be achieved by using fewer bits to represent the
data or by updating less frequently. If we are only
interested in handling normal speech, then the region of interest could be constrained to the lower
part of the face. This would be sufficient for applications such as lip reading or low bandwidth
video-conferencing, possibly allowing a higher resolution image to be used with lower communication
requirements.
3 Facial expression animation
The goal of the animation system is to construct a full
sequence of facial images while moving the head,
talking, and changing expressions. The input to the
animation system is a few images of a given face,
a set of image parameters that describe the location
of a few facial features for each of the in-between
frames, and the type of expression. These image parameters can be determined either manually when
a new movie is generated or automatically by a tracking system [e.g., Moses et al. (1995); Black and
Yacoob (1995)] when used for video compression.
The image parameters contain a few control points
(between 10 and 20) on the lip, chin, and eye contours and the pixels that build up the mouth interior.
The output of the animation system is the sequence
of the in-between frames and the movie built out
of it.
The study of pixel movements presented in the previous section yields a set of reconstruction algorithms
for each expression and each facial region. We view
each such algorithm as a motion template parameterized according to the specific face at hand. These
parameters are extracted from the image sequence
as described below. In other words, for each expression we have a function F ( p0 , t) which gives for
each point p0 , p0 ∈ template, at time t, t ∈ [0, 1]
during the expression the “ideal” motion vector of
the frame at time t. The vector is represented by its
length l = F ( p0 , t) and its direction N, such that
F ( p0 , t)/l = N.
In practice, however, this function is very complex
to define as one entity. Therefore, the face is partitioned into regions with a common typical motion,
as illustrated in Fig. 4. These regions depend only
upon the given contours and are extracted from them
without user intervention. The motion of the pixels
in the region can be induced from a few parameters.
We denote the regions of the face as D j ⊆ template,
for 1 ≤ j ≤ number_regions. For each region we de-
G. Moiza et al.: Image-based animation of facial expressions
a
451
b
Fig. 4. a An illustration of the different regions into which the face is divided. These regions are computed automatically from
the given lip and chin contours; b The different regions of the news broadcaster’s face
note by F j ( p0 , t) the function defined on that region,
1 ≤ j ≤ number_regions, p0 ∈ D j , t ∈ [0, 1], such
that
F ( p0 , t) =
F j ( p0 , t).
j
The function F should be continuous and differential. Special care should be taken to ensure these
properties hold on the boundaries between the regions.
Let the two given static images be Im 0 and Im 1 ,
the in-between image we wish to construct be Im,
and a point p ∈ Im. We are seeking functions f i ( p),
i = 0, 1, over the vector field such that f i ( p) is continuous, differential, and
inferred from them. These motion vectors are computed from the given control points on the contours,
and the time t Im is computed from the relative distances between the contours of the three images as
will be described below.
The facial expression animation algorithm consists
of two main parts: at first a rigid transformation is
found to compensate for head movements and then
the nonrigid transformation is recovered to deal with
facial expressions. Extensive work has been done in
the area of recovery of rigid transformations (Black
and Yacoob 1995; Blake and Zisserman 1987; Black
and Anandan 1996). We will discuss it and describe
the method we use later. Our main contribution, however, is in recovering the nonrigid transformations
which do not have simple mathematical descriptions.
This method will be presented now.
f i ( p) = vi = pi − p
where pi ∈ Im i , i = 0, 1. That is to say that vi describes the motion of a point p to its position in
Im i . Had the image sequence been of an “ideal” subject, we could have derived these functions using
F −1 ( p, t Im ) where t Im is the time of Im in the sequence. However, as “ideal” subjects are hard to find,
F −1 ( p, t Im ) has to be modified to fit the subject. This
is not difficult to do because F is parameterized by
a small number of motion vectors for each region
while the motion vectors for all the pixels can be
3.1 Nonrigid transformation recovery
Our method for recovering the nonrigid transformation consists of four stages: contour construction,
contour correspondence, pixel correspondence, and
color interpolation. We elaborate on each of these
stages below.
1. Contour construction: We are given a few points
on the lip, chin, and eye contours (5 − 10 points on
each contour). These points can be either specified
452
G. Moiza et al.: Image-based animation of facial expressions
by the end user or can be recovered automatically
by a tracking system [e.g., Moses et al. (1995)]. Our
algorithm constructs B-splines which follow the contours using the control points. These contours are
extracted both for the in-between image Im and for
the static images Im i , i = 0, 1.
2. Contour correspondence: At this stage a correspondence between points on the contours is established. The correspondence is found between the various contours on the three images – the two static
frames Im i , i = 0, 1, and the in-between image Im
that needs to be constructed.
Let a contour Ci ⊆ Im i , i = 0, 1, and the matching contour C ⊆ Im. We require that the following
equation holds for every point c ∈ C (on the contour): f i (c) + c ∈ Ci . Our correspondence strategy is
based on arc-length parameterization. Arc-length parameterization can effectively handle stretches of the
contours.
3. Pixel correspondence: Given the location of
a pixel p ∈ Im, the objective is to find two corresponding pixels, p0 ∈ Im 0 and p1 ∈ Im 1 , such that
p0 = f 0 ( p) + p and p1 = f 1 ( p) + p. This is done
for every pixel location in image Im. This correspondence is based both upon the templates found
for the various expressions (as described in Sect. 2)
and upon the contour correspondence previously
computed.
We now use D j ⊆ Im to denote the regions of the
face. Similarly, we denote the regions of the face
in the input images as D ji ⊆ Im i , 1 ≤ j ≤ number_regions, i = 0, 1. We require that the following
equation holds for the contours bordering the regions D j :
f i (∂D j ) + ∂D j = ∂D ji .
The above conditions under-constrain the required
function. This is actually the situation in which
optical-flow-based systems operate. The optical flow
of points on the contours can be computed quite
accurately. However, in the regions between the contours the optical flow is estimated very poorly and
therefore the resulting optical flow is an interpolation
of the optical-flow values which were found on the
boundaries.
Therefore we use the empirical knowledge in order to add enough constraints and to guarantee the
correct reconstruction. We have several types of
functions F j (and therefore, several types of functions f j ). The function F j is very simple for the rigid
regions and more complicated for the nonrigid regions (of course a rigid function is a simple special
case of a nonrigid function).
Generally, we proceed in three stages: finding the
locations of the contour points in Im related to the
point p, finding the corresponding contour points in
Im 0 and Im 1 , and finally, finding the corresponding
pixels p0 and p1 .
Note that the pixel correspondence function has to be
continuous both within the regions and on the borders between the various regions. In Sect. 3.2 we will
discuss the function for each region of the face in
detail.
4. Color interpolation: During the previous stage,
two corresponding pixels p0 ∈ Im 0 and p1 ∈ Im 1
were found for a given pixel p in the output frame.
The goal of the current stage is to compute the value
(intensity or color) of p. We found that a linear
interpolation between the two corresponding pixels is sufficient. We compute the distances between
the mouth curve in the in-between image Im to the
mouth curves in the static images Im 0 and Im 1 . The
coefficients used for the linear interpolation are the
relative distances between the mouth curves on the
various images. Let intensity0 (resp. intensity1) be
the value of p0 (resp. p1 ). Let k be the relative distance between the mouth curve in Im to the mouth
curve in Im 0 . The resulting intensity of the pixel is
thus
intensity = (1 − k) × intensity0 + k × intensity1 .
There are various ways to find distances between
contours. We use a very simple scheme which finds
the average distance between the corresponding
points on the curves, using the correspondence found
in stage 2. Other methods for finding distances between curves can be used as well [e.g., Blake and
Isard (1994); Bergler et al. (1997)].
3.2 Handling the various regions
Given a pixel p ∈ Im the goal is to find the corresponding pixels p0 ∈ Im 0 and p1 ∈ Im 1 . We will now
show how they are found for each region of the face.
We made an effort to find motion patterns which
could be used in as many facial expressions as possible. In most cases we were able to find a single
G. Moiza et al.: Image-based animation of facial expressions
a
453
b
Fig. 5a,b. The chin region: a In the destination image Im; b In the source images Im i , i = 0, 1
algorithm that fits all the expressions. The exception
is the region above the mouth where we use two different algorithms, one for smile, open–close and ee
and the other for oo and oh. The reason for that is that
the upper lip gets more contracted in the oo and oh
expressions than in the other expressions.
The chin region: The algorithm for the chin region is
also used for the regions above and below the eyes.
The algorithm has the following steps. We first draw
a vertical line passing through p ∈ Im and find the
intersection points c1 and c2 of this line with the contours of the chin and the lower lip. For each of these
points we find the arc-length parameter ti on the corresponding curve (see Fig. 5a) and use these values
to compute the corresponding points cij on the other
two images (see Fig. 5b). We also find u, the ratio of
the distance from p to c2 with respect to the distance
from c1 to c2 .
We then find the corresponding points pi ∈ Im i corresponding to p ∈ Im utilizing the equation
pi = uci1 + (1 − u)ci2 .
Note that the case where the motion is not completely vertical but somewhat tilted is also handled.
The cheek region: A cheek region is characterized by
zero motion near the ear which gets larger for points
closer to the chin. The initial angle of the motion vector is between 45◦ and 75◦ , which approaches 90◦
as we approach the chin. Given a point p ∈ Im we
would like to find the angle and length of the motion
vector. We denote by θmin the minimal angle of the
motion vector near the ear and the first three points
on the chin contour as ci , i = {0, 1, 2}. θmin is estimated as the average tangent direction at these first
three points. We also find the horizontal ratio u of
the distance from c0 to p with respect to the distance from c to the beginning of the chin region (see
Fig. 6a).
Next we find L max which is the maximal length of the
motion vector between points on Im 0 and Im 1 . This
length is recovered from points on the chin curve on
the border of the chin area.
To find θ and L in Im we use the following equation
(also see Fig. 6b).
θ = θmin + (90 − θmin )u
L = L max u.
Finally, we find p0 ∈ Im 0 and p1 ∈ Im 1 using the following equations.
p0 = p − kL(cos θ, sin θ)
p1 = p + (1 − k)L(cos θ, sin θ)
where k denotes the closeness of Im to Im 0 as described above.
The region above the mouth: For this region two different algorithms are utilized, one for the expressions
smile, ee, and open–close and the other for the expressions oo and oh. We will first describe the former
algorithm. Given p ∈ Im we find the point c1 which
is closest to p on the upper lip contour and its arclength parameter on the contour t1 . We also find the
454
G. Moiza et al.: Image-based animation of facial expressions
6a
6b
7a
7b
8
Fig. 6a,b. The cheek region: a In the destination image Im; b In the source images Im i , i = 0, 1
Fig. 7a,b. The region above the mouth in the smile, ee, and open–close expression: a In the destination image Im; b In the
source images Im i , i = 0, 1
Fig. 8. The region above the mouth in the oo and oh expressions
distance from p to c1 denoted as L (Fig. 7a). Using
t1 we find the point ci1 on the lip contours of Im i . We
draw a normal to the curve of length L from ci1 to
find pi in Im i as illustrated in Fig. 7b.
This algorithm is not appropriate for the oo and oh
expressions because in these expressions the upper
lip shrinks considerably. Thus we use an algorithm
similar to the one described for the cheeks. Rather
than elaborating on this algorithm we illustrate it
in Fig. 8.
The inner mouth region: One of the most challenging problems is to construct a model of the inner
mouth containing the tongue and the teeth. In this paper we assume we have the image of the inner mouth
whose size is on average 0.5% of the total size of the
face.
The forehead region: Since the forehead hardly deforms, given a pixel p ∈ Im we choose pi ∈ Im i such
that pi = p. This identity transformation is obviously
G. Moiza et al.: Image-based animation of facial expressions
455
9
Fig. 9. Extracting the eye parameters
Fig. 10. The result of the extraction process: for each
of the two eyes three regions are shown: the white of
the eye, the iris, and the pupil
10
applied after the rigid transformation which has been
applied to the whole face.
The eye region: The eyes exhibit quite complex motions which have to be compensated for by the algorithm. Eyes can open and close, pupils can dilate and
contract and the eye itself can move within its socket.
We assume that for each image in the sequence
we are given a few points on the contours of the
eyelids, the location of the center of the pupil,
and the radii of the pupil and the iris (a tracking system can supply this information). For each
such sequence of images we construct a canonical image for each eye. Then, when we want to
create the eye in image Im all that is needed is
a small number of parameters and the canonical
image. We will now describe how the canonical
eye image is built and how the eye image of Im is
reconstructed.
The canonical image of the eye is comprised of
three subimages. The white region of the eye is
created from the union of all the white regions of
the eye in the image sequence. This is required
because in each image we see different parts of
the white region due to the motion of the iris and
the opening and closing of the eyelids. Similarly
we compute the union of the iris images and finally we extract the image of the biggest pupil. See
Fig. 9 for an illustration of this process. See Fig. 10
for the results of this algorithm run on an image
sequence.
Suppose now that we are given the canonical image
and we wish to construct the eye of Im. We need to
obtain a few points on the eye contours, the center of
the pupil c0 , the radius of the pupil R1 , and the radius of the iris R2 . Given a point p ∈ Im in the eye
we wish to reconstruct, we determine to which of
the three regions it belongs and copy the color value
from the respective eye component image.
3.3 Rigid transformations
Before we start compensating for the motion due to
expressions we have to compensate for total head
movement. We would like to recover the three dimensional motion of the head. But as we do not have
a three dimensional model we are looking for a transformation which will compensate for this movement
as much as possible. We thus assume that the face
456
G. Moiza et al.: Image-based animation of facial expressions
is a planar surface and estimate the projective transformation that this plane undergoes from image to
image.
There are two ways to recover this transformation,
one which uses point correspondences to compute
the transformation and another which tries to minimize an error function of the difference in color values in the two images.
We tried to use the first method by choosing four
corresponding points in each image. To get the best
estimate we chose the points to be nearly coplanar.
This method yielded very inaccurate results due to
this method’s non-robustness to small uncertainties
in point positions.
We therefore turned to the second method which tries
to find a transformation which minimizes the difference between the transformed image and the other
image.
A projective transformation of a planar point (x, y)
in that plane moves it to (x ′ , y ′ ) = (x + u(x, y), y +
v(x, y)) where u and v have the following functions.
u(x, y) = a0 + a1 x + a2 y + a6 x 2 + a7 x y
v(x, y) = a3 + a4 x + a5 y + a6 x y + a7 y 2.
When we are given the parameters ai the transformation can simply be applied. However, when we are
given images whose rigid transformation needs to be
determined, our task is to find a0 · · · a7 such that the
transformed image and the other image are most similar. That is,
ρ(I2 (x + u, y + v) − I1 (x, y))
E=
where σ is the control parameter of ρ which determines the tolerance for large errors. The larger σ is,
the more tolerant ρ is for larger errors.
We minimize the error using the simultaneous overrelaxation method (SOR) (Black and Anandan 1996).
This method starts with an initial guess for the parameters ai and using a gradient descent method
converges to a local minimum of the error function.
However, to converge to the global minimum an initial guess which is close to that minimum should
be found. The biggest problem is to find an initial
guess for the two-dimensional translation component which might be quite large whereas the rotation
components are usually quite small and can be estimated as zero. We therefore use a pyramid-based
method to estimate it. At first the algorithm is run
on a small image which is a scaled down version
of the original image. The result is then used as the
initial guess for an image twice as big in both coordinates. This process continues until the algorithm is
run on the original image. During the iterative optimization procedure the parameter σ of the function ρ
is reduced causing the optimized function to be less
tolerant to errors [graduated non-convexity (Blake
and Zisserman 1987)].
The above procedure is used to extract the rigid motion parameters a0 · · · a7 between two images. We
apply this procedure twice: once between the input
image Im 0 and the image we are constructing Im,
and the second time between Im 1 and Im. Using the
extracted parameters we transform Im 0 and Im 1 and
their respective contours to the head position of Im.
The nonrigid transformation described in Sect. 3.1 is
applied to the transformed images.
(x,y)∈ℜ2
where I1 (x, y) and I2 (x, y) are the image intensities
at pixels (x, y) in the two images, E is the error function, and ρ is a function applied to the error at a pixel.
The standard choice for ρ(w) is ρ(w) = w2 which
is the least-squares error estimate. This choice is not
appropriate when the model does not explain the motion completely, as is the case with faces where various other motions are involved (Hampel et al. 1986).
We therefore have to choose a more robust function
which is not drastically affected by large errors in
small parts of the face. We therefore chose to use the
Geman and McClure function:
ρ(w, σ) =
w2
σ + w2
4 Experimental results
To test our algorithm, we recorded movies of several
people performing the various expressions. (This
group of people is different from the group of people
whose expressions were analyzed in the empirical
experiments.) Reconstructing in-between images of
the movie is the best way to test the quality of the
algorithm both because we can extract life-like image parameters from the real images and because we
can compare the reconstructed images to the real images from the sequence to evaluate the quality of the
results.
We first tested the algorithm on image sequences in
which actors were asked to perform a single expres-
G. Moiza et al.: Image-based animation of facial expressions
457
Fig. 11. The open–close movie snapshots
sion. The final experiment was on an image sequence
recorded from a television news broadcast. The
full sequences can be viewed at http://www.ee.technion.ac.il/∼ayellet/facial-reconstruction.html.
Our algorithm was given the first and the last frame
of each expression and the image parameters of the
in-between frames. The parameters of the rigid trans-
formation were extracted automatically. The points
used for computing the facial contours were marked
manually (10 − 20 pixel locations).
Some of the results are demonstrated below. Figures 11–13 show snapshots from three of the movies
that were generated by our algorithm. Figure 11
shows the snapshots from a movie that generated an
458
G. Moiza et al.: Image-based animation of facial expressions
Fig. 12. The oh movie snapshots
G. Moiza et al.: Image-based animation of facial expressions
459
Fig. 13. The smile movie snapshots
open–close expression. Figure 12 shows the snapshots from a movie that generated an oh expression.
Figure 13 shows the snapshots from a movie that
generated a smile expression.
To test the quality of our algorithm, we compared the
actual in-between images to the images synthetically
generated by our system. Some of the results are
demonstrated below. For each expression, we show
the two given images, a couple of in-between images
as generated by our system, the actual in-between
images from the real movie, and images that compare the two (i.e., the inverse of the picture generated
by subtracting the colors of the pixels of the real and
synthesized images). Figures 14–18 show a few such
results. An oh expression is illustrated in Fig. 14. An
oo expression is shown in Fig. 15. A smile is demonstrated in Fig. 16. An ee expression is illustrated in
Fig. 17. Finally, the open–close expression is shown
in Fig. 18.
In the final experiment we recorded a movie of
a news broadcaster saying a sentence. The sequence
is 72 frames long (eight words). The sequence is
manually divided into expressions and the algorithm
is applied. Only five full frames are used to reconstruct the full sequence and the results are very good
as can be seen in Figs. 19–22.
As the results are hardly distinguishable from the
original image sequence, our method can also be
460
G. Moiza et al.: Image-based animation of facial expressions
14a
14b
14c
14d
15c
15d
15a
15b
Fig. 14a–d. The oh expression: a The original images; b The original in-betweens; c The synthesized in-betweens; d The reverse
difference images
Fig. 15a–d. The oo expression: a The original images; b The original in-betweens; c The synthesized in-betweens; d The reverse
difference images
G. Moiza et al.: Image-based animation of facial expressions
461
16a
16b
16c
16d
17c
17d
17a
17b
Fig. 16a–d. The smile expression: a The original images; b The original in-betweens; c The synthesized in-betweens; d The reverse
difference images
Fig. 17a–d. The ee expression: a The original images; b The original in-betweens; c The synthesized in-betweens; d The reverse
difference images
462
G. Moiza et al.: Image-based animation of facial expressions
18a
18b
18c
18d
19c
19d
19a
19b
Fig. 18a–d. The open–close expression: a The original images; b The original in-betweens; c The synthesized in-betweens; d The
reverse difference images
Fig. 19a–d. The oo expression of the news broadcaster: a The original images; b The original in-betweens; c The synthesized inbetweens; d The reverse difference images
G. Moiza et al.: Image-based animation of facial expressions
463
20a
20b
20c
20d
21c
21d
21a
21b
Fig. 20a–d. The ee expression of the news broadcaster: a The original images; b The original in-betweens; c The synthesized inbetweens; d The reverse difference images
Fig. 21a–d. The open–close expression of the news broadcaster: a The original images; b The original in-betweens; c The synthesized in-betweens; d The reverse difference images
464
G. Moiza et al.: Image-based animation of facial expressions
a
b
c
d
Fig. 22a–d. The smile expression of the news broadcaster: a The original images; b The original in-betweens; c The synthesized inbetweens; d The reverse difference images
used for video compression, if a tracking
system extracts the parameters accurately. In this
case, it is important to know how much storage is required with respect to other compression techniques.
On average, we transfer one full frame in fifteen. For
each of the in-between frames we transfer a few parameters describing the contours (approximately 20
pixel locations), which can be neglected. Moreover,
for each of the in-between frames we transfer the
pixels on the inner mouth, which might vary between
0 pixels (mouth is closed) to 50 × 50 pixels (mouth
is open) for an image of a size of 768 × 576 pixels.
In addition we transfer one image of the eyes (for
the whole sequence) of approximate size 100 × 200.
Thus on average, we get a storage reduction in the
order of 1:13.
It is important to mention that the full images that
need to be stored or transmitted, as well as the images of the inner mouth of the in-between frames,
can be compressed by any of the well-known compression schemes, on top of our technique. When this
is done, we get a much better storage reduction. For
instance, if standard JPEG is used on top of our proposed technique, assuming a compression by a factor
of 1:30–1:40, we can get a compression by a factor of
about 1:390–1:520 on average.
Finally we compared the compression rate of our
algorithm to that of MPEG2. MPEG2 should work
wonders on this frame sequence since the background does not change and the motions are very
small, which is exactly what MPEG2 exploits in
its compression scheme. We compressed the original sequence using MPEG2 and compared it to our
key frames compressed using MPEG2. The resulting
movie is four times smaller than the original MPEG2
movie. Our compression rate is only 4 and not 13 because the difference between the key frames is larger
than the difference between consecutive frames in
the original movie. We obtained this major improvement over MPEG2 because we exploit our knowledge of the typical motion of the face while undergoing expressions.
5 Conclusion
The human face is one of the most important and
interesting objects encountered in our daily lives.
G. Moiza et al.: Image-based animation of facial expressions
Animating the human face is an intriguing challenge. Realistic facial animations can have numerous applications, among which are generation of
new scenes in movies, dubbing, video compression,
video-conferencing, and intelligent man–machine
interfaces.
In this paper we have explored the issue of generating movies of facial expressions given only a small
number of frames from a sequence, and very few image parameters for the in-between frames. Since the
change of expressions cannot be described mathematically, we have been empirically studying image
patterns in facial motion. Our experimentations revealed that patterns do exist.
The empirical results served as a guide in the creation of a facial animation system. A major contribution of this paper is in showing how an “ideal”
motion stored in a template can be used to generate
animations of real faces, using a few parameters.
The results we achieved are very good. The animations created are highly realistic. In fact, the comparisons between the synthesized images created by
our system and the original in-between images show
that the differences are negligible. When the method
is used for compression, one full image in fifteen is
needed on average, and we get a compression rate of
1:13. In this case an accurate tracking system is required. Combining our scheme with MPEG2 yields
results which are four times better than using only
MPEG2.
There are several possible future directions. First, the
empirical results can be applicable to a wide range
of uses, most notably, expression recognition. Another issue we are exploring is how to build a model
of the inner mouth. The goal is to make it possible
to further reduce the storage and the bandwidth requirements. Third, incorporating illumination models into the system seems an intriguing problem. Currently, drastic changes in illumination are handled
by sending more images as key frames. Finally, we
are planning to use our algorithm in various applications such as distance-learning systems and dubbing
systems.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
References
19.
1. Beier T, Neely S (1992) Feature-based image metamorphosis. Proc SIGGRAPH 26(2):35–42
2. Bergeron P, Lachapelle P (1985) Controlling facial expressions and body movements in the computer-animated short
20.
21.
465
“Tony De Peltrie”. In: ACM SIGGRAPH Advanced Computer Animation Seminar Notes, tutorial
Beymer D, Poggio T (1996) Image representation for visual
learning. Science 272:1905–1909
Black MJ, Anandan P (1996) The robust estimation of multiple motions: parametric and piecewise-smooth flow fields.
Comput Vis Image Underst 63:75–104
Black MJ, Yacoob Y (1995) Tracking and recognizing rigid
and non-rigid facial motions using parametric models of
image motion. In: Proc. Fifth International Conference on
Computer Vision (ICCV ’95), pp 374–381, IEEE Computer
Society Press, Cambridge, MA
Blake A, Isard M (1994) 3D position, attitude and shape input using video tracking of hands and lips. In: Proc. ACM
SIGGRAPH Conference, pp 185–192, Orlando, Florida
Blake A, Zisserman A (1987) Visual reconstruction. MIT
Press, Cambridge, MA
Bregler C, Covewll M, Slaney M (1997) Video rewrite:
driving visual speech with audio. In: Proc. ACM SIGGRAPH Conference, pp 353–360, Los Angeles, CA
Cassell J, Pelachaud C, Badler N, Steedman M, Achorn M,
Becket T, Douville B, Prevost S, Stone M (1994) Animated
conversation: rule-based generation of facial expression,
gesture and spoken intonation for multiple conversational
agents. Proc SIGGRAPH 28(2):413–420
Cootes TF, Edwards GJ, Taylor CJ (1998) Active appearance models In: Proc. of European Conference on
Computer Vision – ECCV ’98, vol II. Lecture Notes in
Computer Science, vol 1407. Springer, Berlin Heidelberg,
pp 484–498.
DeCarlo D, Metaxas D, Stone M (1998) An anthropometric face models using variational techniques. In: Proc. ACM
SIGGRAPH Conference, pp 67–74, Orlando, Florida
Duda RM, Hart PE (1973) Pattern classification and scene
analysis. Wiley, New York
Essa I, Basu S, Darrell T, Pentland A (1996) Modeling,
tracking and interactive animation of faces and heads using
input from video. In: Proc. Computer Animation Conference, pp 68–79, Geneva, Switzerland
Ezzat T, Poggio T (1998) MikeTalk: a talking facial display
based on morphing visemes. In: Proc. Computer Animation
Conference, Philadelphia, Pennsylvania
Guenter B, Grimm C, Wood D, Malvar H, Pighin F (1998)
Making faces. In: Proc. ACM SIGGRAPH Conference, pp
55–66, Orlando, Florida
Hampel FR, Renchetti EM, Rousseeuw PJ, Stahel WA
(1986) Robust statistics: the approach based on influence
functions. Wiley, New York
Lee Y, Terzopoulos D, Waters K (1993) Constructing
physics-based facial models of individuals. In. Proc. Graphics Interface, pp 1–8, Toronto, Canada
Lee SY, Chwa KY, Shin SY, Wolberg G (1995a) Image
metamorphosis using snakes and free-form deformations.
Proc. SIGGRAPH ’95, pp 439–448, Los Angeles, CA
Lee Y, Terzopoulos D, Waters K (1995b) Realistic modeling for facial animation. In: ACM SIGGRAPH, pp 55–62,
Los Angeles, CA
Moses Y (1994) Face recognition: generalization to novel
images. Ph.d. thesis. Weizmann Institute of Science
Moses Y, Reynard D, Blake A (1995) Determining facial
expressions in real time. In: Proc. Fifth International Con-
466
22.
23.
24.
25.
26.
27.
G. Moiza et al.: Image-based animation of facial expressions
ference on Computer Vision – ICCV ’95. IEEE Computer
Society Press, pp 296–301, Cambridge, MA
Parke FI (1972) Computer generated animation of faces. In:
Proc. National Conference, Vol 1, pp 451–457, Boston, MA
Parke FI, Waters K (1996) Computer facial animation. A K
Peters, Wellesley, MA
Pighin F, Hecker J, Lischinski D, Szeliski R, Salesin
DH (1998) Synthesizing realistic expressions from photographs. In: Proc. SIGGRAPH ’98, pp 75–84, Orlando,
Florida
Sali E, Ullman S (1998) Recognizing novel 3-D objects under new illumination and viewing position using a small
number of examples views or even a single view. In: Proc.
International Conference on Computer Vision, Bombay,
India
Seitz SM, Dyer CR (1996) View morphing: Synthesizing
3D metamorphosis using image transforms In: Proc. SIGGRAPH ’96, pp 21–30, New Orleans, Louisiana
Shashua A (1992) Illumination and view position in 3D visual recognition. In: Moody J, Hanson JE, Lippman R (eds)
Advances in neural information processing systems 4. Morgan Kaufman, San Mateo, CA, pp 68–74
28. Terzopoulos D, Waters K (1990) Physically-based facial
modeling, analysis, and animation. J Vis Comput Anim
1(4):73–80
29. Ullman S, Basri R (1991) Recognition by linear combinations of models. IEEE Trans Pattern Anal Mach Intell
13:992–1005
30. Vetter T, Poggio T (1997) Linear object classes and image
synthesis from a single example image. IEEE Trans Pattern
Anal Mach Intell 19:733–742
31. Waters K (1987) A muscle model for animating threedimensional facial expression. In: ACM SIGGRAPH Conference Proceedings, vol 21, pp 17–24, Anaheim, CA
32. Williams L (1990) Performance-driven facial animation.
In: ACM SIGGRAPH Conference Proceedings, Vol 24,
pp 235–242, Dallas, TX
33. Wolberg G (1990) Digital image warping. IEEE Computer
Society Press, Los Alamitos
Photographs of the authors and their biographies are given on
the next page.
G. Moiza et al.: Image-based animation of facial expressions
G IDEON M OIZA received
his practical engineer degree in
1989 from Bosmat College in
Haifa, Israel. His B.Sc. degree in
Electrical Engineering in 1996,
and M.Sc degree in Electrical
Engineering in 2000, both from
the Technion – Israel institute
of technology, Electrical Engineering department. In his M.Sc
degree he was a teaching assistant in the Electrical Engineering department and conducted
a research in “Image Based Animation of Facial Expressions”
under the supervision of Dr. Ayellet Tal. Currently working
in the industry in the field of computer networks. His main
interests concern computer graphics and computer networks.
AYELLET TAL received the
B.Sc. in Mathematics and Computer Science in 1986, and
her M.Sc. in Computer Science in 1989, both from TelAviv University. She received
her Ph.D. in Computer Science
from Princeton University in
1995. Dr. Tal was a Postdoctorate Fellow at the Department
of Applied Mathematics at the
Weizmann Institute of Science
(1995–1997). She joined the
Department of Electrical Engineering at the Technion – Israel
Institute of Technology in 1997. Her research interests concern computer graphics, computational geometry, animation,
scientific visualization and software visualization.
I LAN S HIMSHONI recieved
the B.Sc. in mathematics from
the Hebrew University in Jerusalem in 1984, his M.Sc. in computer science from the Weizmann Institute of Science in
1989, and his Ph.D. in computer
science from the University of
Illinois at Urbana Champaign
(UIUC) in 1995. He was a postdoctorate fellow at the faculty
of computer science at the Technion Israel, from 1995–1998,
and joined the faculty of industrial engineering and management in the Technion Israel, in 1998. His main research interests
are in the fields of computer vision, robotics and their applications to other fields such as computer graphics.
467
DAVID A. BARNETT Received BSEE from Columbia
University – School of Engineering and Applied Science in
1994, and Masters in Electrical
Engineering degree from Technion – Israel Institute of Technology in 1998. Currently lives
in New York City with wife and
two cats, and works designing
audiovisual and videoconferencing systems, and programming
control systems. Interests include virtual reality, travel, hiking, cooking. Member IEEE,
ACM.
D R . YAEL M OSES received
her B.Sc. in mathematics and
computer science from the Hebrew University, Jerusalem, Israel, in 1984. She received her
M.Sc. and Ph.D in computer science from the Weizmann Institute of Science, Rehovot, Israel,
in 1986 and 1994, respectively.
She was a postdoctoral fellow in
the Department of Engineering
at Oxford University, Oxford,
UK, in 1993–1994. She then
was a Postdoctoral fellow in the
Weizmann Institute of Science,
in 1994–1998. Currently, she is a senior lecturer at the Interdisciplinary Center Herzeliya, Israel. Her main research interests
include human vision, computer vision, and applications of
computer vision to multimedia systems.