Academia.eduAcademia.edu

Image-based animation of facial expressions

2002, The Visual Computer

We present a novel technique for creating realistic facial animations given a small number of real images and a few parameters for the in-between images. This scheme can also be used for reconstructing facial movies where the parameters can be automatically extracted from the images. The in-between images are produced without ever generating a three-dimensional model of the face. Since facial motion due to expressions are not well defined mathematically our approach is based on utilizing image patterns in facial motion. These patterns were revealed by an empirical study which analyzed and compared image motion patterns in facial expressions. The major contribution of this work is showing how parameterized "ideal" motion templates can generate facial movies for different people and different expressions, where the parameters are extracted automatically from the image sequence. To test the quality of the algorithm, image sequences (one of which was taken from a TV news broadcast) were reconstructed, yielding movies hardly distinguishable from the originals.

1 Introduction Image-based animation of facial expressions Gideon Moiza1 , Ayellet Tal1 , Ilan Shimshoni2 , David Barnett1, Yael Moses3 1 Department of Electrical Engineering, Technion – IIT, Haifa, 32000, Israel E-mail: [email protected] 2 Department of Industrial Engineering, Technion – IIT, Haifa, 32000, Israel E-mail: [email protected] 3 Department of Computer Science, Interdisciplinary Center, Herzeliya, Israel E-mail: [email protected] Published online: 2 October 2002 c Springer-Verlag 2002  We present a novel technique for creating realistic facial animations given a small number of real images and a few parameters for the in-between images. This scheme can also be used for reconstructing facial movies where the parameters can be automatically extracted from the images. The in-between images are produced without ever generating a three-dimensional model of the face. Since facial motion due to expressions are not well defined mathematically our approach is based on utilizing image patterns in facial motion. These patterns were revealed by an empirical study which analyzed and compared image motion patterns in facial expressions. The major contribution of this work is showing how parameterized “ideal” motion templates can generate facial movies for different people and different expressions, where the parameters are extracted automatically from the image sequence. To test the quality of the algorithm, image sequences (one of which was taken from a TV news broadcast) were reconstructed, yielding movies hardly distinguishable from the originals. Key words: Facial animation – Facial expression reconstruction – Image morphing Correspondence to: A. Tal Work has been supported in part by the Israeli Ministry of Industry and Trade, The MOST Consortium The Visual Computer (2002) 18:445–467 Digital Object Identifier (DOI) 10.1007/s003710100157 The human face is one of the most complex and interesting objects that we come across on a regular basis. The face, and the myriad expressions and gestures that it is capable of making, are a key component of human interaction and communication. People are extremely adept at recognizing faces. This attribute presents both an advantage and a challenge to any system that manipulates facial images. The viewer is likely to be able to instantly spot any defects or shortcomings in the image. If the image is not a perfect rendition of an actual face, both in appearance and in motion, the user will notice the discrepancies. Therefore, any facial application needs to be highly accurate if it is to be successful. In this paper we explore the issue of constructing images of facial expressions. Our method uses a small number of full frames. In addition, for each in-between frame, a small number of points on contours in the image are sufficient to describe the shape and the location of facial features and to generate the frames. We will show that our method is capable of generating an image sequence in a faithful and complete manner. This method can be used for producing computer graphics animations, for intelligent compression of video-conferencing systems and other low-bandwidth video applications, for intelligent man–machine interfaces, and for video databases of facial images or animations. Though the goal is to produce animations, one plausible way to test the quality of the technique is to sample a real facial movie, reconstruct it with our algorithm, and compare the two movies. This suggests that the same method can be used for the compression of facial movies. In this case, the control parameters can be extracted automatically from the original movies using stateof-the-art tracking systems [e.g., Blake and Isard (1994); Moses et al. (1995)]. Most of the techniques for facial animations utilized in computer graphics are based on modeling the three-dimensional structure of a human face and rendering it using some reflectance properties [for a survey, see Parke and Waters (1996)]. These techniques are the state-of-the-art for facial animations and generate extremely compelling results. Geometric interpolation between the facial models is used in Parke (1972), where the models are digitized by hand. Measurements of real actors are used in Bergeron and Lachapelle (1985), Essa et al. (1996), and Williams (1990). The system described in Guenter et al. (1998) captures the facial expressions in three di- 446 G. Moiza et al.: Image-based animation of facial expressions mensions and can replay a three-dimensional polygonal face model with a changing texture map. The process begins with a video of a live actor’s face. Meshes representing the face are used jointly with models of the skin and the muscles in Waters (1987), Terzopoulos and Waters (1990), Lee et al. (1993), and Lee et al. (1995b). The system of Cassell et al. (1994) is designed to automatically animate a conversation between human-like agents. Generating new face geometries automatically, depending on a mathematical description of possible face geometries, is proposed in DeCarlo et al. (1998). It has been shown in Pighin et al. (1998) how two-dimensional morphing techniques can be combined with threedimensional transformations of a geometric model to automatically produce three-dimensional facial expressions. We propose a different approach which avoids modeling and rendering in three dimensions. It is thus less expensive. Instead, we use as a basis a set of real images of the face in question, and we create the animations in the image domain by producing the inbetween images. In this regard, our work is more related to the image morphing approach, such as described in Wolberg (1990), Beier and Neely (1992), and Lee et al. (1995a), which has proven to be very effective. One problem with most of this work, however, is that the end user is required to specify dozens of carefully chosen parameters. Moreover, these methods are more appropriate for morphing one object into another, where the object can be a person. Since the in-between person is not known, errors are more tolerable. Our goal, however, is different. We want to generate a movie of a specific person given a few frames from a movie of this person. In Bergler et al. (1997), a related though different problem is discussed. The mouth regions are morphed in order to lip-synch existing video to a novel soundtrack. This method is different from ours since it combines footage from two video sequences, while we generate the sequence artificially. In Ezzat and Poggio (1998), an animated talking head system is described, where morphing techniques are used to combine visemes taken from an existing corpus. This method is different from ours since we generate the visemes artificially. In addition, in Ezzat and Poggio (1998) the inter-viseme video sequence is generated by optical flow while we use model-based optical flow as discussed below. Related work has also been done in the field of computer vision, where image-based approaches are utilized, bypassing three-dimensional models. In Cootes et al. (1998) a face’s shape and appearance is described by a large number of parameters. Given the parameters of a new image, the face can be reconstructed. This approach differs from ours as we use information from neighboring images in the movie in order to reconstruct the in-between images. As a result, not only are many fewer parameters needed, but also the resulting images look sharper as we avoid warping. In Beymer and Poggio (1996), optical flow of a given image to a base image is utilized. Both texture vectors and shape vectors, which form separate linear vector spaces, are used. The advantage of this approach is that optical flow is done automatically and is well explored. In addition, it is general and is not model-based. A consequence of this generality is that the resulting images are not as sharp as the approach we are pursuing, since optical flow is not always accurate. One way to look at our approach is as a model-based optical flow, where the model of the optical flow has been discovered empirically and is fixed for each expression. Generally, the animation of faces requires handling the effects of the viewing position, the illumination conditions, and the facial expressions. There has been a lot of work done that considers the effects of viewing position and illumination, mostly in computer vision [for a few examples, see Ullman and Basri (1991); Moses (1994); Shashua (1992); Moses (1994); Seitz and Dyer (1996); Vetter and Poggio (1997); Sali and Ullman (1998)]. Unlike the changes in the viewing position, which can be described by rigid transformations, the nonrigid transformations determining human expressions are not well defined mathematically. One way to handle these transformations is to learn from examples the set of transformations of a face while talking or changing expressions. In this paper we take this avenue while focusing on these nonrigid transformations. We have been empirically studying image patterns in facial expressions. Though skin deformation can result from complex muscle actions and bone motion, our experimentations revealed that patterns do exist in images, both with respect to the same person on different occasions, and also between different people performing the same expression. Moreover, these patterns can be expressed in a rather simple way. In addition, our experimentations showed that there is a strong cor- G. Moiza et al.: Image-based animation of facial expressions 447 Fig. 1. Sample frames showing dot placement and movement relation between the motion of regions in a face and the motion of a small number of contours bordering them. Thus, the major contribution of this work is showing how, based on empirical results, parameterized motion templates can generate facial movies. These parameters are extracted automatically from the image sequence. We show that given very few images and a few image parameters, the full sequence can be reconstructed. Our technique is combined with existing techniques dealing with changes in viewing position, yielding a system that deals both with head motion and facial expressions in a faithful manner. We ran our algorithm on several movies of people performing various expressions, including a movie of a TV broadcaster. In order to evaluate the quality of our algorithm, we compared our artificial movies to the original movies. The movies were hardly distinguishable. On average, one full frame in fifteen is need. If we take into account the other information we use, the compression rate is 1:13. It is possible to combine our animation algorithm with standard compression schemes to get higher compression rates. Using our algorithm above MPEG2 reduces the size of the movie by a factor of 4 with respect to MPEG2. When the algorithm is used for compression, an accurate tracking system is required. The rest of this paper is organized as follows. In Sect. 2 we describe the empirical experiments we conducted and draw conclusions. The goal of the experiments was to find image motion patterns that occur during facial expressions. In Sect. 3 we describe our algorithm for animating facial expressions, using the empirical analysis. In Sect. 4 we present the experimental results produced by our animation algorithm. We conclude in Sect. 5. 2 Empirical analysis of facial expressions In this section we describe our experiments for finding image patterns in facial expressions. The goal of the empirical research was to find parameters that can characterize the various expressions and which can be used to produce animations. Thus, our empirical research was guided by two main objectives. The first was to see how much variance there was in the way a person performed the same expression on different runs. Then, we wanted to determine if there was a correlation between the way different people perform the same expression. In order to provide a broad enough sample base, five subjects were selected, and each expression was repeated five times. Expressions representative of the types encountered in normal conversational speech were selected. First, a simple mouth open–close operation was performed, followed by the sounds ‘oh’, ‘oo’, and ‘ee’. A smile and a frown were then performed, as examples of non-speech expressions that commonly occur during a normal conversation. In between the execution of subsequent expressions, the subject returned to a neutral position, with mouth closed. In order to track the motion, small uniformly colored stickers were attached to the subjects’ faces. The stickers were applied in such a way as to maximize the resolution in the areas of the face that were expected to move the most. The application pattern can be seen in Fig. 1. Once the data were collected, several stages of analysis were performed in order to chart the motion of the dots over time. First, the data points associated with each sticker were grouped together. Singular 448 G. Moiza et al.: Image-based animation of facial expressions Fig. 2. Vector graphs for the open–close expression for a single subject value decomposition (SVD) was performed on each set of points, in order to determine the eigendirection in which each dot moved. The SVD of a matrix X yields three matrices, such that X = UΣV T . Σ is a diagonal matrix holding the singular values of X. The first column of U holds the normalized x and y components of the directional vector which is the principle axis of an ellipsoid drawn to best fit the data points (Duda and Hart 1973). The SVD was used to obtain a normalized vector indicating the direction in which the maximum variance of the data points was found. The lengths of the vectors were determined by projecting the data points along the directional eigenvector. The vector data were used to create one graph per run, per person, per expression. In addition to the individual graphs, graphs displaying the mean and standard deviation of the motion per person for each expression were generated. See, for example, the graphs generated for the open–close expression for a single subject in Fig. 2. Finally, the data from all the runs of each expression were averaged, resulting in six graphs displaying the overall results across subjects, as shown in Fig. 3. The mean is drawn in blue and the standard deviation in red. This allowed us to see if different people perform the same expression the same way. After examining the data, several important observations can be made. The most important result is that clear patterns of motion emerge for the different expressions. The uniformity means that it is safe to make the assumption that at least within certain bounds, people perform the same expression the same way. Therefore, algorithms can be developed based on these results, and they should be applicable to the majority of people performing these expressions. Variations in the motion were also observed, both between different runs by the same person as well as between different subjects. These variations can be attributed to several sources. First, the subjects were human, and not robotic automata, and therefore exhibited a certain degree of individuality. For this same reason, they did not always perform the same expression the same way. Variations were largest for those expressions which allowed the greatest freedom in the way they could be performed. For example, although there is only one way to open and close the mouth, the ‘oo’ sound can be made with mouth open or closed, lips pursed or relaxed. The results can best be described by dealing separately with the various expressions and the characteristics observed in the way the different subjects performed them. The graphs provide a fairly intuitive way to understand how the face moved while performing the various expressions. The vector associated with each sticker is positioned so that its midpoint corresponds to the center of a bounding box drawn around the maximum extents of the point’s motion. It should be noted that the eigendirection vectors describe the overall dot movement, but do not necessarily provide a complete picture of the dot’s trajectory over time. For the graphs involving data averaged over several runs, three vectors were drawn. The same length, namely the mean, was used for all three, with the direction changing to illustrate the mean ± standard deviation. All three vectors are drawn with their centers located at the average midpoint of the various runs. We summarize our findings for the six expressions below. • Open–close: Of the expressions, open–close was the easiest to recognize from the graphs, because of the large vertical displacement of the points on the chin. (This motion, however, also made it difficult to track, because the locations of the points on the chin overlapped as the mouth was opened.) • OO: Generally, the ‘oo’ expression was characterized by small motions. The corners of the mouth move in, and the mouth opens slightly. G. Moiza et al.: Image-based animation of facial expressions -100 449 -100 -100 -50 -50 -50 0 0 0 50 50 50 100 100 100 150 150 150 200 200 200 250 250 250 300 350 -150 300 -100 -50 0 50 100 150 300 -150 -100 -50 0 50 100 150 350 -150 a b c -100 -100 -100 -50 -50 -50 0 0 0 50 50 50 100 100 100 150 150 150 200 200 200 250 250 250 300 -150 d -100 -50 0 50 100 150 300 -150 -100 -50 e 0 50 100 150 300 -150 -100 -50 0 50 100 150 -100 -50 0 50 100 150 f Fig. 3a–f. Vector graphs: a Vector graph for average open–close; b Vector graph for average Oo; c Vector graph for average Oh; d Vector graph for average Ee; e Vector graph for average smile; f Vector graph for average frown Some of the subjects barely moved their mouths at all, others clearly pursed their lips. • OH: The ‘oh’ was a cross between the OpenClose and ‘oo’. The mouth opened, and the edges of the mouth came slightly together. • EE: The ‘ee’ can be considered to be the “opposite” of the ‘oo’. The corners of the mouth move out as opposed to in. Again, the mouth opens slightly. • Smile: In the smile expression, the corners of the mouth move up and out. This was the one expression where the points under the eyes moved a significant amount, almost straight up and down. 450 G. Moiza et al.: Image-based animation of facial expressions • Frown: The frown was the expression that the subjects had the most difficulty with. Different subjects performed completely different motions under the guise of ‘frown’. There was also a variance between runs of the same person. In addition, it was the most difficult expression to analyze. Different points tended to move in different directions, even points in close proximity to each other. Ideally, each expression would be performed in a unique manner, the motion of the points would be large, linear, and smooth, and the motion would be symmetrical between the left and right sides of the face. In practice, however, there were variations between the way the different subjects performed the expressions. In general, the subjects exhibited various aspects of the following categories. There was no single person who completely fit a single profile. Nevertheless, such a classification is useful, as it allows us to focus on the types of difficulties encountered when video of real people is used as input. • Nonlinear: The first group that deviated from the ideal were those who exhibited nonlinear motion. The points follow curved trajectories, or else exhibit hysteresis (i.e., the point does not follow the same path when closing the mouth as it did when opened). • Nonsymmetric: Many tasks would be made easier if everyone moved their faces symmetrically. If that were the case, then we could compute the motions from one side of the face and merely reflect them for the other side. Unfortunately, this was not always the case. • Inconsistent: All of the subjects were inconsistent to a certain degree between runs of the same expression. For instance, sometimes the jaw would shift to the left, other times to the right. In some cases, the mouth was opened for the ‘oo’, other times not. Of the expressions, the frown produced the least consistent results, with sometimes wild fluctuations from run to run. One of the most interesting results is how little the upper part of the face moves during most of the expressions. (There are other expressions, however, in which the eyebrows move as well.) This implies that we can cut communication and processing costs by concentrating on the part of the face that does move, and using a much coarser algorithm for the upper face. These savings could be achieved by using fewer bits to represent the data or by updating less frequently. If we are only interested in handling normal speech, then the region of interest could be constrained to the lower part of the face. This would be sufficient for applications such as lip reading or low bandwidth video-conferencing, possibly allowing a higher resolution image to be used with lower communication requirements. 3 Facial expression animation The goal of the animation system is to construct a full sequence of facial images while moving the head, talking, and changing expressions. The input to the animation system is a few images of a given face, a set of image parameters that describe the location of a few facial features for each of the in-between frames, and the type of expression. These image parameters can be determined either manually when a new movie is generated or automatically by a tracking system [e.g., Moses et al. (1995); Black and Yacoob (1995)] when used for video compression. The image parameters contain a few control points (between 10 and 20) on the lip, chin, and eye contours and the pixels that build up the mouth interior. The output of the animation system is the sequence of the in-between frames and the movie built out of it. The study of pixel movements presented in the previous section yields a set of reconstruction algorithms for each expression and each facial region. We view each such algorithm as a motion template parameterized according to the specific face at hand. These parameters are extracted from the image sequence as described below. In other words, for each expression we have a function F ( p0 , t) which gives for each point p0 , p0 ∈ template, at time t, t ∈ [0, 1] during the expression the “ideal” motion vector of the frame at time t. The vector is represented by its length l = F ( p0 , t) and its direction N, such that F ( p0 , t)/l = N. In practice, however, this function is very complex to define as one entity. Therefore, the face is partitioned into regions with a common typical motion, as illustrated in Fig. 4. These regions depend only upon the given contours and are extracted from them without user intervention. The motion of the pixels in the region can be induced from a few parameters. We denote the regions of the face as D j ⊆ template, for 1 ≤ j ≤ number_regions. For each region we de- G. Moiza et al.: Image-based animation of facial expressions a 451 b Fig. 4. a An illustration of the different regions into which the face is divided. These regions are computed automatically from the given lip and chin contours; b The different regions of the news broadcaster’s face note by F j ( p0 , t) the function defined on that region, 1 ≤ j ≤ number_regions, p0 ∈ D j , t ∈ [0, 1], such that  F ( p0 , t) = F j ( p0 , t). j The function F should be continuous and differential. Special care should be taken to ensure these properties hold on the boundaries between the regions. Let the two given static images be Im 0 and Im 1 , the in-between image we wish to construct be Im, and a point p ∈ Im. We are seeking functions f i ( p), i = 0, 1, over the vector field such that f i ( p) is continuous, differential, and inferred from them. These motion vectors are computed from the given control points on the contours, and the time t Im is computed from the relative distances between the contours of the three images as will be described below. The facial expression animation algorithm consists of two main parts: at first a rigid transformation is found to compensate for head movements and then the nonrigid transformation is recovered to deal with facial expressions. Extensive work has been done in the area of recovery of rigid transformations (Black and Yacoob 1995; Blake and Zisserman 1987; Black and Anandan 1996). We will discuss it and describe the method we use later. Our main contribution, however, is in recovering the nonrigid transformations which do not have simple mathematical descriptions. This method will be presented now. f i ( p) = vi = pi − p where pi ∈ Im i , i = 0, 1. That is to say that vi describes the motion of a point p to its position in Im i . Had the image sequence been of an “ideal” subject, we could have derived these functions using F −1 ( p, t Im ) where t Im is the time of Im in the sequence. However, as “ideal” subjects are hard to find, F −1 ( p, t Im ) has to be modified to fit the subject. This is not difficult to do because F is parameterized by a small number of motion vectors for each region while the motion vectors for all the pixels can be 3.1 Nonrigid transformation recovery Our method for recovering the nonrigid transformation consists of four stages: contour construction, contour correspondence, pixel correspondence, and color interpolation. We elaborate on each of these stages below. 1. Contour construction: We are given a few points on the lip, chin, and eye contours (5 − 10 points on each contour). These points can be either specified 452 G. Moiza et al.: Image-based animation of facial expressions by the end user or can be recovered automatically by a tracking system [e.g., Moses et al. (1995)]. Our algorithm constructs B-splines which follow the contours using the control points. These contours are extracted both for the in-between image Im and for the static images Im i , i = 0, 1. 2. Contour correspondence: At this stage a correspondence between points on the contours is established. The correspondence is found between the various contours on the three images – the two static frames Im i , i = 0, 1, and the in-between image Im that needs to be constructed. Let a contour Ci ⊆ Im i , i = 0, 1, and the matching contour C ⊆ Im. We require that the following equation holds for every point c ∈ C (on the contour): f i (c) + c ∈ Ci . Our correspondence strategy is based on arc-length parameterization. Arc-length parameterization can effectively handle stretches of the contours. 3. Pixel correspondence: Given the location of a pixel p ∈ Im, the objective is to find two corresponding pixels, p0 ∈ Im 0 and p1 ∈ Im 1 , such that p0 = f 0 ( p) + p and p1 = f 1 ( p) + p. This is done for every pixel location in image Im. This correspondence is based both upon the templates found for the various expressions (as described in Sect. 2) and upon the contour correspondence previously computed. We now use D j ⊆ Im to denote the regions of the face. Similarly, we denote the regions of the face in the input images as D ji ⊆ Im i , 1 ≤ j ≤ number_regions, i = 0, 1. We require that the following equation holds for the contours bordering the regions D j : f i (∂D j ) + ∂D j = ∂D ji . The above conditions under-constrain the required function. This is actually the situation in which optical-flow-based systems operate. The optical flow of points on the contours can be computed quite accurately. However, in the regions between the contours the optical flow is estimated very poorly and therefore the resulting optical flow is an interpolation of the optical-flow values which were found on the boundaries. Therefore we use the empirical knowledge in order to add enough constraints and to guarantee the correct reconstruction. We have several types of functions F j (and therefore, several types of functions f j ). The function F j is very simple for the rigid regions and more complicated for the nonrigid regions (of course a rigid function is a simple special case of a nonrigid function). Generally, we proceed in three stages: finding the locations of the contour points in Im related to the point p, finding the corresponding contour points in Im 0 and Im 1 , and finally, finding the corresponding pixels p0 and p1 . Note that the pixel correspondence function has to be continuous both within the regions and on the borders between the various regions. In Sect. 3.2 we will discuss the function for each region of the face in detail. 4. Color interpolation: During the previous stage, two corresponding pixels p0 ∈ Im 0 and p1 ∈ Im 1 were found for a given pixel p in the output frame. The goal of the current stage is to compute the value (intensity or color) of p. We found that a linear interpolation between the two corresponding pixels is sufficient. We compute the distances between the mouth curve in the in-between image Im to the mouth curves in the static images Im 0 and Im 1 . The coefficients used for the linear interpolation are the relative distances between the mouth curves on the various images. Let intensity0 (resp. intensity1) be the value of p0 (resp. p1 ). Let k be the relative distance between the mouth curve in Im to the mouth curve in Im 0 . The resulting intensity of the pixel is thus intensity = (1 − k) × intensity0 + k × intensity1 . There are various ways to find distances between contours. We use a very simple scheme which finds the average distance between the corresponding points on the curves, using the correspondence found in stage 2. Other methods for finding distances between curves can be used as well [e.g., Blake and Isard (1994); Bergler et al. (1997)]. 3.2 Handling the various regions Given a pixel p ∈ Im the goal is to find the corresponding pixels p0 ∈ Im 0 and p1 ∈ Im 1 . We will now show how they are found for each region of the face. We made an effort to find motion patterns which could be used in as many facial expressions as possible. In most cases we were able to find a single G. Moiza et al.: Image-based animation of facial expressions a 453 b Fig. 5a,b. The chin region: a In the destination image Im; b In the source images Im i , i = 0, 1 algorithm that fits all the expressions. The exception is the region above the mouth where we use two different algorithms, one for smile, open–close and ee and the other for oo and oh. The reason for that is that the upper lip gets more contracted in the oo and oh expressions than in the other expressions. The chin region: The algorithm for the chin region is also used for the regions above and below the eyes. The algorithm has the following steps. We first draw a vertical line passing through p ∈ Im and find the intersection points c1 and c2 of this line with the contours of the chin and the lower lip. For each of these points we find the arc-length parameter ti on the corresponding curve (see Fig. 5a) and use these values to compute the corresponding points cij on the other two images (see Fig. 5b). We also find u, the ratio of the distance from p to c2 with respect to the distance from c1 to c2 . We then find the corresponding points pi ∈ Im i corresponding to p ∈ Im utilizing the equation pi = uci1 + (1 − u)ci2 . Note that the case where the motion is not completely vertical but somewhat tilted is also handled. The cheek region: A cheek region is characterized by zero motion near the ear which gets larger for points closer to the chin. The initial angle of the motion vector is between 45◦ and 75◦ , which approaches 90◦ as we approach the chin. Given a point p ∈ Im we would like to find the angle and length of the motion vector. We denote by θmin the minimal angle of the motion vector near the ear and the first three points on the chin contour as ci , i = {0, 1, 2}. θmin is estimated as the average tangent direction at these first three points. We also find the horizontal ratio u of the distance from c0 to p with respect to the distance from c to the beginning of the chin region (see Fig. 6a). Next we find L max which is the maximal length of the motion vector between points on Im 0 and Im 1 . This length is recovered from points on the chin curve on the border of the chin area. To find θ and L in Im we use the following equation (also see Fig. 6b). θ = θmin + (90 − θmin )u L = L max u. Finally, we find p0 ∈ Im 0 and p1 ∈ Im 1 using the following equations. p0 = p − kL(cos θ, sin θ) p1 = p + (1 − k)L(cos θ, sin θ) where k denotes the closeness of Im to Im 0 as described above. The region above the mouth: For this region two different algorithms are utilized, one for the expressions smile, ee, and open–close and the other for the expressions oo and oh. We will first describe the former algorithm. Given p ∈ Im we find the point c1 which is closest to p on the upper lip contour and its arclength parameter on the contour t1 . We also find the 454 G. Moiza et al.: Image-based animation of facial expressions 6a 6b 7a 7b 8 Fig. 6a,b. The cheek region: a In the destination image Im; b In the source images Im i , i = 0, 1 Fig. 7a,b. The region above the mouth in the smile, ee, and open–close expression: a In the destination image Im; b In the source images Im i , i = 0, 1 Fig. 8. The region above the mouth in the oo and oh expressions distance from p to c1 denoted as L (Fig. 7a). Using t1 we find the point ci1 on the lip contours of Im i . We draw a normal to the curve of length L from ci1 to find pi in Im i as illustrated in Fig. 7b. This algorithm is not appropriate for the oo and oh expressions because in these expressions the upper lip shrinks considerably. Thus we use an algorithm similar to the one described for the cheeks. Rather than elaborating on this algorithm we illustrate it in Fig. 8. The inner mouth region: One of the most challenging problems is to construct a model of the inner mouth containing the tongue and the teeth. In this paper we assume we have the image of the inner mouth whose size is on average 0.5% of the total size of the face. The forehead region: Since the forehead hardly deforms, given a pixel p ∈ Im we choose pi ∈ Im i such that pi = p. This identity transformation is obviously G. Moiza et al.: Image-based animation of facial expressions 455 9 Fig. 9. Extracting the eye parameters Fig. 10. The result of the extraction process: for each of the two eyes three regions are shown: the white of the eye, the iris, and the pupil 10 applied after the rigid transformation which has been applied to the whole face. The eye region: The eyes exhibit quite complex motions which have to be compensated for by the algorithm. Eyes can open and close, pupils can dilate and contract and the eye itself can move within its socket. We assume that for each image in the sequence we are given a few points on the contours of the eyelids, the location of the center of the pupil, and the radii of the pupil and the iris (a tracking system can supply this information). For each such sequence of images we construct a canonical image for each eye. Then, when we want to create the eye in image Im all that is needed is a small number of parameters and the canonical image. We will now describe how the canonical eye image is built and how the eye image of Im is reconstructed. The canonical image of the eye is comprised of three subimages. The white region of the eye is created from the union of all the white regions of the eye in the image sequence. This is required because in each image we see different parts of the white region due to the motion of the iris and the opening and closing of the eyelids. Similarly we compute the union of the iris images and finally we extract the image of the biggest pupil. See Fig. 9 for an illustration of this process. See Fig. 10 for the results of this algorithm run on an image sequence. Suppose now that we are given the canonical image and we wish to construct the eye of Im. We need to obtain a few points on the eye contours, the center of the pupil c0 , the radius of the pupil R1 , and the radius of the iris R2 . Given a point p ∈ Im in the eye we wish to reconstruct, we determine to which of the three regions it belongs and copy the color value from the respective eye component image. 3.3 Rigid transformations Before we start compensating for the motion due to expressions we have to compensate for total head movement. We would like to recover the three dimensional motion of the head. But as we do not have a three dimensional model we are looking for a transformation which will compensate for this movement as much as possible. We thus assume that the face 456 G. Moiza et al.: Image-based animation of facial expressions is a planar surface and estimate the projective transformation that this plane undergoes from image to image. There are two ways to recover this transformation, one which uses point correspondences to compute the transformation and another which tries to minimize an error function of the difference in color values in the two images. We tried to use the first method by choosing four corresponding points in each image. To get the best estimate we chose the points to be nearly coplanar. This method yielded very inaccurate results due to this method’s non-robustness to small uncertainties in point positions. We therefore turned to the second method which tries to find a transformation which minimizes the difference between the transformed image and the other image. A projective transformation of a planar point (x, y) in that plane moves it to (x ′ , y ′ ) = (x + u(x, y), y + v(x, y)) where u and v have the following functions. u(x, y) = a0 + a1 x + a2 y + a6 x 2 + a7 x y v(x, y) = a3 + a4 x + a5 y + a6 x y + a7 y 2. When we are given the parameters ai the transformation can simply be applied. However, when we are given images whose rigid transformation needs to be determined, our task is to find a0 · · · a7 such that the transformed image and the other image are most similar. That is,  ρ(I2 (x + u, y + v) − I1 (x, y)) E= where σ is the control parameter of ρ which determines the tolerance for large errors. The larger σ is, the more tolerant ρ is for larger errors. We minimize the error using the simultaneous overrelaxation method (SOR) (Black and Anandan 1996). This method starts with an initial guess for the parameters ai and using a gradient descent method converges to a local minimum of the error function. However, to converge to the global minimum an initial guess which is close to that minimum should be found. The biggest problem is to find an initial guess for the two-dimensional translation component which might be quite large whereas the rotation components are usually quite small and can be estimated as zero. We therefore use a pyramid-based method to estimate it. At first the algorithm is run on a small image which is a scaled down version of the original image. The result is then used as the initial guess for an image twice as big in both coordinates. This process continues until the algorithm is run on the original image. During the iterative optimization procedure the parameter σ of the function ρ is reduced causing the optimized function to be less tolerant to errors [graduated non-convexity (Blake and Zisserman 1987)]. The above procedure is used to extract the rigid motion parameters a0 · · · a7 between two images. We apply this procedure twice: once between the input image Im 0 and the image we are constructing Im, and the second time between Im 1 and Im. Using the extracted parameters we transform Im 0 and Im 1 and their respective contours to the head position of Im. The nonrigid transformation described in Sect. 3.1 is applied to the transformed images. (x,y)∈ℜ2 where I1 (x, y) and I2 (x, y) are the image intensities at pixels (x, y) in the two images, E is the error function, and ρ is a function applied to the error at a pixel. The standard choice for ρ(w) is ρ(w) = w2 which is the least-squares error estimate. This choice is not appropriate when the model does not explain the motion completely, as is the case with faces where various other motions are involved (Hampel et al. 1986). We therefore have to choose a more robust function which is not drastically affected by large errors in small parts of the face. We therefore chose to use the Geman and McClure function: ρ(w, σ) = w2 σ + w2 4 Experimental results To test our algorithm, we recorded movies of several people performing the various expressions. (This group of people is different from the group of people whose expressions were analyzed in the empirical experiments.) Reconstructing in-between images of the movie is the best way to test the quality of the algorithm both because we can extract life-like image parameters from the real images and because we can compare the reconstructed images to the real images from the sequence to evaluate the quality of the results. We first tested the algorithm on image sequences in which actors were asked to perform a single expres- G. Moiza et al.: Image-based animation of facial expressions 457 Fig. 11. The open–close movie snapshots sion. The final experiment was on an image sequence recorded from a television news broadcast. The full sequences can be viewed at http://www.ee.technion.ac.il/∼ayellet/facial-reconstruction.html. Our algorithm was given the first and the last frame of each expression and the image parameters of the in-between frames. The parameters of the rigid trans- formation were extracted automatically. The points used for computing the facial contours were marked manually (10 − 20 pixel locations). Some of the results are demonstrated below. Figures 11–13 show snapshots from three of the movies that were generated by our algorithm. Figure 11 shows the snapshots from a movie that generated an 458 G. Moiza et al.: Image-based animation of facial expressions Fig. 12. The oh movie snapshots G. Moiza et al.: Image-based animation of facial expressions 459 Fig. 13. The smile movie snapshots open–close expression. Figure 12 shows the snapshots from a movie that generated an oh expression. Figure 13 shows the snapshots from a movie that generated a smile expression. To test the quality of our algorithm, we compared the actual in-between images to the images synthetically generated by our system. Some of the results are demonstrated below. For each expression, we show the two given images, a couple of in-between images as generated by our system, the actual in-between images from the real movie, and images that compare the two (i.e., the inverse of the picture generated by subtracting the colors of the pixels of the real and synthesized images). Figures 14–18 show a few such results. An oh expression is illustrated in Fig. 14. An oo expression is shown in Fig. 15. A smile is demonstrated in Fig. 16. An ee expression is illustrated in Fig. 17. Finally, the open–close expression is shown in Fig. 18. In the final experiment we recorded a movie of a news broadcaster saying a sentence. The sequence is 72 frames long (eight words). The sequence is manually divided into expressions and the algorithm is applied. Only five full frames are used to reconstruct the full sequence and the results are very good as can be seen in Figs. 19–22. As the results are hardly distinguishable from the original image sequence, our method can also be 460 G. Moiza et al.: Image-based animation of facial expressions 14a 14b 14c 14d 15c 15d 15a 15b Fig. 14a–d. The oh expression: a The original images; b The original in-betweens; c The synthesized in-betweens; d The reverse difference images Fig. 15a–d. The oo expression: a The original images; b The original in-betweens; c The synthesized in-betweens; d The reverse difference images G. Moiza et al.: Image-based animation of facial expressions 461 16a 16b 16c 16d 17c 17d 17a 17b Fig. 16a–d. The smile expression: a The original images; b The original in-betweens; c The synthesized in-betweens; d The reverse difference images Fig. 17a–d. The ee expression: a The original images; b The original in-betweens; c The synthesized in-betweens; d The reverse difference images 462 G. Moiza et al.: Image-based animation of facial expressions 18a 18b 18c 18d 19c 19d 19a 19b Fig. 18a–d. The open–close expression: a The original images; b The original in-betweens; c The synthesized in-betweens; d The reverse difference images Fig. 19a–d. The oo expression of the news broadcaster: a The original images; b The original in-betweens; c The synthesized inbetweens; d The reverse difference images G. Moiza et al.: Image-based animation of facial expressions 463 20a 20b 20c 20d 21c 21d 21a 21b Fig. 20a–d. The ee expression of the news broadcaster: a The original images; b The original in-betweens; c The synthesized inbetweens; d The reverse difference images Fig. 21a–d. The open–close expression of the news broadcaster: a The original images; b The original in-betweens; c The synthesized in-betweens; d The reverse difference images 464 G. Moiza et al.: Image-based animation of facial expressions a b c d Fig. 22a–d. The smile expression of the news broadcaster: a The original images; b The original in-betweens; c The synthesized inbetweens; d The reverse difference images used for video compression, if a tracking system extracts the parameters accurately. In this case, it is important to know how much storage is required with respect to other compression techniques. On average, we transfer one full frame in fifteen. For each of the in-between frames we transfer a few parameters describing the contours (approximately 20 pixel locations), which can be neglected. Moreover, for each of the in-between frames we transfer the pixels on the inner mouth, which might vary between 0 pixels (mouth is closed) to 50 × 50 pixels (mouth is open) for an image of a size of 768 × 576 pixels. In addition we transfer one image of the eyes (for the whole sequence) of approximate size 100 × 200. Thus on average, we get a storage reduction in the order of 1:13. It is important to mention that the full images that need to be stored or transmitted, as well as the images of the inner mouth of the in-between frames, can be compressed by any of the well-known compression schemes, on top of our technique. When this is done, we get a much better storage reduction. For instance, if standard JPEG is used on top of our proposed technique, assuming a compression by a factor of 1:30–1:40, we can get a compression by a factor of about 1:390–1:520 on average. Finally we compared the compression rate of our algorithm to that of MPEG2. MPEG2 should work wonders on this frame sequence since the background does not change and the motions are very small, which is exactly what MPEG2 exploits in its compression scheme. We compressed the original sequence using MPEG2 and compared it to our key frames compressed using MPEG2. The resulting movie is four times smaller than the original MPEG2 movie. Our compression rate is only 4 and not 13 because the difference between the key frames is larger than the difference between consecutive frames in the original movie. We obtained this major improvement over MPEG2 because we exploit our knowledge of the typical motion of the face while undergoing expressions. 5 Conclusion The human face is one of the most important and interesting objects encountered in our daily lives. G. Moiza et al.: Image-based animation of facial expressions Animating the human face is an intriguing challenge. Realistic facial animations can have numerous applications, among which are generation of new scenes in movies, dubbing, video compression, video-conferencing, and intelligent man–machine interfaces. In this paper we have explored the issue of generating movies of facial expressions given only a small number of frames from a sequence, and very few image parameters for the in-between frames. Since the change of expressions cannot be described mathematically, we have been empirically studying image patterns in facial motion. Our experimentations revealed that patterns do exist. The empirical results served as a guide in the creation of a facial animation system. A major contribution of this paper is in showing how an “ideal” motion stored in a template can be used to generate animations of real faces, using a few parameters. The results we achieved are very good. The animations created are highly realistic. In fact, the comparisons between the synthesized images created by our system and the original in-between images show that the differences are negligible. When the method is used for compression, one full image in fifteen is needed on average, and we get a compression rate of 1:13. In this case an accurate tracking system is required. Combining our scheme with MPEG2 yields results which are four times better than using only MPEG2. There are several possible future directions. First, the empirical results can be applicable to a wide range of uses, most notably, expression recognition. Another issue we are exploring is how to build a model of the inner mouth. The goal is to make it possible to further reduce the storage and the bandwidth requirements. Third, incorporating illumination models into the system seems an intriguing problem. Currently, drastic changes in illumination are handled by sending more images as key frames. Finally, we are planning to use our algorithm in various applications such as distance-learning systems and dubbing systems. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. References 19. 1. Beier T, Neely S (1992) Feature-based image metamorphosis. Proc SIGGRAPH 26(2):35–42 2. Bergeron P, Lachapelle P (1985) Controlling facial expressions and body movements in the computer-animated short 20. 21. 465 “Tony De Peltrie”. In: ACM SIGGRAPH Advanced Computer Animation Seminar Notes, tutorial Beymer D, Poggio T (1996) Image representation for visual learning. Science 272:1905–1909 Black MJ, Anandan P (1996) The robust estimation of multiple motions: parametric and piecewise-smooth flow fields. Comput Vis Image Underst 63:75–104 Black MJ, Yacoob Y (1995) Tracking and recognizing rigid and non-rigid facial motions using parametric models of image motion. In: Proc. Fifth International Conference on Computer Vision (ICCV ’95), pp 374–381, IEEE Computer Society Press, Cambridge, MA Blake A, Isard M (1994) 3D position, attitude and shape input using video tracking of hands and lips. In: Proc. ACM SIGGRAPH Conference, pp 185–192, Orlando, Florida Blake A, Zisserman A (1987) Visual reconstruction. MIT Press, Cambridge, MA Bregler C, Covewll M, Slaney M (1997) Video rewrite: driving visual speech with audio. In: Proc. ACM SIGGRAPH Conference, pp 353–360, Los Angeles, CA Cassell J, Pelachaud C, Badler N, Steedman M, Achorn M, Becket T, Douville B, Prevost S, Stone M (1994) Animated conversation: rule-based generation of facial expression, gesture and spoken intonation for multiple conversational agents. Proc SIGGRAPH 28(2):413–420 Cootes TF, Edwards GJ, Taylor CJ (1998) Active appearance models In: Proc. of European Conference on Computer Vision – ECCV ’98, vol II. Lecture Notes in Computer Science, vol 1407. Springer, Berlin Heidelberg, pp 484–498. DeCarlo D, Metaxas D, Stone M (1998) An anthropometric face models using variational techniques. In: Proc. ACM SIGGRAPH Conference, pp 67–74, Orlando, Florida Duda RM, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York Essa I, Basu S, Darrell T, Pentland A (1996) Modeling, tracking and interactive animation of faces and heads using input from video. In: Proc. Computer Animation Conference, pp 68–79, Geneva, Switzerland Ezzat T, Poggio T (1998) MikeTalk: a talking facial display based on morphing visemes. In: Proc. Computer Animation Conference, Philadelphia, Pennsylvania Guenter B, Grimm C, Wood D, Malvar H, Pighin F (1998) Making faces. In: Proc. ACM SIGGRAPH Conference, pp 55–66, Orlando, Florida Hampel FR, Renchetti EM, Rousseeuw PJ, Stahel WA (1986) Robust statistics: the approach based on influence functions. Wiley, New York Lee Y, Terzopoulos D, Waters K (1993) Constructing physics-based facial models of individuals. In. Proc. Graphics Interface, pp 1–8, Toronto, Canada Lee SY, Chwa KY, Shin SY, Wolberg G (1995a) Image metamorphosis using snakes and free-form deformations. Proc. SIGGRAPH ’95, pp 439–448, Los Angeles, CA Lee Y, Terzopoulos D, Waters K (1995b) Realistic modeling for facial animation. In: ACM SIGGRAPH, pp 55–62, Los Angeles, CA Moses Y (1994) Face recognition: generalization to novel images. Ph.d. thesis. Weizmann Institute of Science Moses Y, Reynard D, Blake A (1995) Determining facial expressions in real time. In: Proc. Fifth International Con- 466 22. 23. 24. 25. 26. 27. G. Moiza et al.: Image-based animation of facial expressions ference on Computer Vision – ICCV ’95. IEEE Computer Society Press, pp 296–301, Cambridge, MA Parke FI (1972) Computer generated animation of faces. In: Proc. National Conference, Vol 1, pp 451–457, Boston, MA Parke FI, Waters K (1996) Computer facial animation. A K Peters, Wellesley, MA Pighin F, Hecker J, Lischinski D, Szeliski R, Salesin DH (1998) Synthesizing realistic expressions from photographs. In: Proc. SIGGRAPH ’98, pp 75–84, Orlando, Florida Sali E, Ullman S (1998) Recognizing novel 3-D objects under new illumination and viewing position using a small number of examples views or even a single view. In: Proc. International Conference on Computer Vision, Bombay, India Seitz SM, Dyer CR (1996) View morphing: Synthesizing 3D metamorphosis using image transforms In: Proc. SIGGRAPH ’96, pp 21–30, New Orleans, Louisiana Shashua A (1992) Illumination and view position in 3D visual recognition. In: Moody J, Hanson JE, Lippman R (eds) Advances in neural information processing systems 4. Morgan Kaufman, San Mateo, CA, pp 68–74 28. Terzopoulos D, Waters K (1990) Physically-based facial modeling, analysis, and animation. J Vis Comput Anim 1(4):73–80 29. Ullman S, Basri R (1991) Recognition by linear combinations of models. IEEE Trans Pattern Anal Mach Intell 13:992–1005 30. Vetter T, Poggio T (1997) Linear object classes and image synthesis from a single example image. IEEE Trans Pattern Anal Mach Intell 19:733–742 31. Waters K (1987) A muscle model for animating threedimensional facial expression. In: ACM SIGGRAPH Conference Proceedings, vol 21, pp 17–24, Anaheim, CA 32. Williams L (1990) Performance-driven facial animation. In: ACM SIGGRAPH Conference Proceedings, Vol 24, pp 235–242, Dallas, TX 33. Wolberg G (1990) Digital image warping. IEEE Computer Society Press, Los Alamitos Photographs of the authors and their biographies are given on the next page. G. Moiza et al.: Image-based animation of facial expressions G IDEON M OIZA received his practical engineer degree in 1989 from Bosmat College in Haifa, Israel. His B.Sc. degree in Electrical Engineering in 1996, and M.Sc degree in Electrical Engineering in 2000, both from the Technion – Israel institute of technology, Electrical Engineering department. In his M.Sc degree he was a teaching assistant in the Electrical Engineering department and conducted a research in “Image Based Animation of Facial Expressions” under the supervision of Dr. Ayellet Tal. Currently working in the industry in the field of computer networks. His main interests concern computer graphics and computer networks. AYELLET TAL received the B.Sc. in Mathematics and Computer Science in 1986, and her M.Sc. in Computer Science in 1989, both from TelAviv University. She received her Ph.D. in Computer Science from Princeton University in 1995. Dr. Tal was a Postdoctorate Fellow at the Department of Applied Mathematics at the Weizmann Institute of Science (1995–1997). She joined the Department of Electrical Engineering at the Technion – Israel Institute of Technology in 1997. Her research interests concern computer graphics, computational geometry, animation, scientific visualization and software visualization. I LAN S HIMSHONI recieved the B.Sc. in mathematics from the Hebrew University in Jerusalem in 1984, his M.Sc. in computer science from the Weizmann Institute of Science in 1989, and his Ph.D. in computer science from the University of Illinois at Urbana Champaign (UIUC) in 1995. He was a postdoctorate fellow at the faculty of computer science at the Technion Israel, from 1995–1998, and joined the faculty of industrial engineering and management in the Technion Israel, in 1998. His main research interests are in the fields of computer vision, robotics and their applications to other fields such as computer graphics. 467 DAVID A. BARNETT Received BSEE from Columbia University – School of Engineering and Applied Science in 1994, and Masters in Electrical Engineering degree from Technion – Israel Institute of Technology in 1998. Currently lives in New York City with wife and two cats, and works designing audiovisual and videoconferencing systems, and programming control systems. Interests include virtual reality, travel, hiking, cooking. Member IEEE, ACM. D R . YAEL M OSES received her B.Sc. in mathematics and computer science from the Hebrew University, Jerusalem, Israel, in 1984. She received her M.Sc. and Ph.D in computer science from the Weizmann Institute of Science, Rehovot, Israel, in 1986 and 1994, respectively. She was a postdoctoral fellow in the Department of Engineering at Oxford University, Oxford, UK, in 1993–1994. She then was a Postdoctoral fellow in the Weizmann Institute of Science, in 1994–1998. Currently, she is a senior lecturer at the Interdisciplinary Center Herzeliya, Israel. Her main research interests include human vision, computer vision, and applications of computer vision to multimedia systems.