Representations of Human Faces: Ax-Lanck - Nstitut Für Biologische Kybernetik

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Max-Planck-Institut

für biologische Kybernetik


Arbeitsgruppe Bülthoff
Spemannstraße 38 • 72076 Tübingen • Germany

Technical Report No. 41 October 16, 1996

Representations of human faces


Nikolaus F. Troje and Thomas Vetter

Abstract

Several models for parameterized face representations have been proposed in the last
years. A simple coding scheme treats the image of a face as a long vector with each entry
coding for the intensity of one single pixel in the image (e.g. Sirovich & Kirby 1987).
Although simple and straightforward, such pixel-based representations have several disad-
vantages. We propose a representation for images of faces that separates texture and 2D
shape by exploiting pixel-by-pixel correspondence between the images. The advantages of
this representation compared to pixel-based representations are demonstrated by means of
the quality of low-dimensional reconstructions derived from principal component analysis
and by means of the performance that a simple linear classifier can achieve for sex classifi-
cation.

This document is available as /pub/mpi-memos/TR-041.ps via anonymous ftp from ftp.mpik-tueb.mpg.de or


from the World Wide Web, http://www.mpik-tueb.mpg.de/projects/TechReport/head.html
1 Introduction resentation, the image is coded by simply concat-
enating all the intensity values of a number of
Few object classes have been examined as
sample points into a large vector. The sample
extensively as the class of human faces. Investiga-
points can correspond to the regular grid of pixels
tions have been carried out in several different
provided by a digitized image on the computer
scientific disciplines. Psychologists and human-
screen, or they can correspond to the photorecep-
ethologists are interested in the way our percep-
tor array in our retina. A 256x256 pixel image
tual system deals with faces when performing
thus results in a 65536 dimensional vector located
tasks such as recognizing individual persons or
in a vector space of equal dimensionality.
rating gender, age, attractiveness or facial expres-
A distance measure in such a simple pixel-
sion. Computer scientists work with human faces
based image space must have a fairly complex
in different areas. In machine vision, much effort
structure to provide for object identification or
has been put into constructing artificial face rec-
classification. Imagine only the locations of two
ognition systems that are able to generalize
views of the same object in such a space that dif-
between different appearances of the same face.
fer only by a slight translation of the object in the
In computer graphics, faces play an important
image. Although a human observer could hardly
role for modelling and animation purposes. Faces
perceive the difference between the two images,
also make interesting objects for the study of effi-
their locations in pixel space would be very dis-
cient coding schemes, as this information is rele-
tinct from each other.
vant for video conferencing and
A space that much better fits the requirements
telecommunication.
of object (and face) recognition is a space in
Most of the problems that have to be solved in
which the objects are coded by means of complex
face recognition are shared by other visual object
high-level features. If the features are chosen such
recognition tasks. In the following paragraphs, we
that they are diagnostic for one attribute (such as
will discuss these problems in more general
identity) but invariant to others (such as illumina-
terms, speaking about “object recognition” rather
tion or orientation), it is easy to construct simple
than about “face recognition”. We will come back
metrics that cluster views of identical objects irre-
to human faces, however, when discussing con-
spective of the viewing conditions. Identification
crete examples for different representations.
as well as classification with respect to other
The input information for our brain are the reti-
object attributes can then be carried out by using
nal images. Artificial systems usually also have to
simple linear classifiers.
rely on two-dimensional images. Object recogni-
Contrasting the pixel-based representation with
tion can be described as the process of finding an
a representation based on high-level features
appropriate measure for the distance between
illustrates the trade-off between the complexity of
stored representations and an incoming image. If
the representation and the complexity of an
the task is object identification, then such a mea-
appropriate distance measure. Using a simple rep-
sure should provide a relatively small distance
resentation of the image of an object requires
between views of the same objects, regardless of
sophisticated distance measures and complex
orientation, illumination and other scene
classifiers, whereas with a complex representa-
attributes not related to the object’s identity. It
tion, simpler distance measures and linear classi-
also should provide a relatively large distance
fiers may be sufficient.
between views of different objects even if they
The transformation from the pixel space into a
share common properties, such as illumination or
feature space involves complex operations, often
orientation. If the task is to estimate the orienta-
including a priori information about the object
tion, the size, or the colour of an object, then a
class the system is dealing with. A crucial point of
distance measure is needed that clusters images of
feature-based representations is how to define and
different objects with the same orientation, size or
how to extract relevant features from the images.
colour, irrespective of their identity.
The same set of indexed features should be avail-
The search for an efficient distance measure
able in all images in order to be able to compare
depends strongly on the choice of an appropriate
them. This means that correspondence between
representation of the images. The simplest and
the features of different images has to be estab-
most straightforward representation of an image
lished. As a consequence, a high-level feature
is a representation that will be called pixel-based
space tends to be model specific. Searching for a
representation throughout this paper. In this rep-
nose, for instance, only makes sense if the algo-

2
rithm is confronted with a face. rable.
A disadvantage of high-level feature spaces That convexity is fulfilled for the correspon-
may be a loss of information due to the feature dence-based representation will become directly
extraction process. A representation coding fea- evident in the next section, in which we develop
tures such as the size of eyes, nose and mouth, our representation step by step starting from a
distances and distance ratios between these fea- simple pixel-based representation. In section 3,
tures, etc., may serve for identification and classi- we address the question of coding efficiency by
fication tasks, but it might be difficult to evaluating low-dimensional reconstructions based
reconstruct the original image from this informa- on principal component analysis. In section 4, we
tion. use sex classification as an example of a classifi-
In the past few years, different researchers have cation task. The generalization performance of a
developed feature-based representations of human simple linear classifier using the different repre-
faces (Beymer, Shashua, & Poggio, 1993; Beymer sentations as input is investigated.
& Poggio, 1996; Costen et al., 1996; Craw &
Cameron, 1991; Hancock et al. 1996; Perrett, 2 Developing a correspondence-based
May, & Yoshikawa, 1994; Vetter, 1996; Vetter & representation
Troje, 1995). The features used for establishing
correspondence span the whole range between As mentioned above, it would be desirable for
semantically meaningful features, such as the cor- a number of different purposes to develop a repre-
ners of the eyes and mouth, to pixel level features sentation of faces that makes it possible to treat
that are defined by the local grey level structure of them as objects in a linear vector space. Such a
the image. Establishing correspondence between “face space” is the basis for developing metrics
the images has been done by either hand-selecting that correspond to differences in identity, gender,
a limited set of features or by using adapted opti- age, etc.
cal flow algorithms that define correspondence on An image of a face (as any other image) can be
the single pixel level. coded in terms of a vector that has as many coor-
In this paper, we will present a particular way dinates as the image has pixels. Each coordinate
of establishing a representation of human faces codes the intensity of one particular pixel in the
that we have developed (Vetter & Troje, 1995). image (Figure 1), so that the vector contains all of
This representation is a feature-based representa- the information in the image. The space spanned
tion, but nevertheless retains all of the informa- by such image vectors, however, has some very
tion contained in the original image. It can thus be unpleasant properties. An important property of a
used not only for recognition purposes but also linear space is the existence of an addition and a
for modelling new faces. Since this representation scalar multiplication which define linear combi-
is based on a pixel-by-pixel correspondence nations of existing objects. All such linear combi-
between two images, we call it a correspondence- nations are objects of the space. In a pixel-based
based representation. representation, this is typically not the case. One
We will compare this representation with a of the simplest linear combinations - the mean of
simple pixel-based representation and evaluate it two faces - will in general not result in a single
by means of three different issues that we con- intermediate face, but rather as two superimposed
sider to be important criteria for a representation images. Any linear combination of a larger set of
flexible enough to serve many of the purposes faces will appear blurry. The set of faces is not
occurring when processing human faces. These closed under addition.
criteria are: These disadvantages can be reduced by care-
• The set of faces should be convex. If two vec- fully standardizing the faces in the images, for
tors in the corresponding space are natural instance, by providing for a common position of
faces, then any vector on the line connecting the eyes. As can be seen from Figure 2, the mean
these vectors should also correspond to a of two faces in this representation looks better
proper face. than it did with no alignment. Nevertheless, there
• The representation should provide an efficient are still plenty of errors in the image. The eyes
coding scheme. Redundancies should be look good now, but the mean face contains two
reduced. mouths and most other parts do not match either.
• Important attributes (identity, sex, age, facial To match the mouths while still keeping the eyes
expression, orientation) should be easily sepa- matched, a scaling operation is needed in addition

3
I1,1

() ()
I1,1
I1,2
I1,2 .
. tx, ty, sx, sy .
m f= . f= In,m
. −tx
−ty

n
In,m sx−1
sy−1

0.5 + 0.5 = 0.5 + 0.5 =

Figure 1: Pixel-based representation. The image of a Figure 3: In this representation, the images are first
face is coded as a vector by concatenating all the pixel aligned using two translations and two scaling opera-
values. The mean of two faces in this representation tions. The image resulting from this alignment is coded
does not yield a “mean face” but rather a superposition together with the parameters describing the alignment.
of two single faces. Although hardly visible in these small reproductions,
the mean face still contains errors due to misalignment.

prototype I1,1

() ()
I1,1 I1,2
I1,2 nxm ..
tx, ty
deformation In,m
f= . field
f=
. dx1,1
dy1,1
In,m ..
−tx dxn,m
−ty
dyn,m

0.5 + 0.5 = 0.5 + 0.5 =

Figure 2: Here, the images are first aligned with respect Figure 4: Correspondence-based representation. The
to a common position of the symmetry plane and a images are deformed to match a common prototype.
commen height of the eyes. Then they are coded as in The vector contains the image after that deformation
the pixel-based representation. The two factors describ- and the deformation field itself. The mean of two faces
ing the necessary translation are also added. The mean in this representation is one single face consisting of the
of two such representations still contains most features mean texture on the mean shape.
twice.
to the translation. This scaling must be done inde- Not only are the eyes aligned, but, at least
pendently in the horizontal and vertical directions, roughly, the mouths are aligned as well. However,
leading not only to a change in size, but also to a a more careful look at the images still reveals sig-
distortion of the face. nificant errors. Since the shapes of the two
Figure 3 shows the representation after this mouths were very different, a closer inspection
improved alignment. The first part of the vector shows that there are still two superimposed
encodes the image resulting from the alignment mouths rather than one. The noses are not aligned
process. The last four coefficients account for the and other features including the outline of the face
translation and scaling operations needed for the are stillnotmatched.
alignment. Note that we did not enter the transla- Continuing with this approach leads to the cor-
tion and scaling factors themselves but their respondence-based representation described in
inverse values. The original image can thus be more detail by Vetter and Troje (1995). Rather
reconstructed from the vector representation by than allowing only simple image operations such
drawing the image encoded in the first part of the as translation or scaling, any image deformation
vector and then performing translation and scal- can be used to align a sample face with a second
ing operations according to the last part of the face that serves as a common prototype. Figure 4
vector. The simple mean of the two sample faces illustrates the resulting representation. The first
using this latter representation is much better now. part of the vector again codes the image resulting

4
after aligning the face to the prototype. The sec- and 40 years, without make-up, facial hair or
ond part of the vector codes the deformation that accessories such as earrings or glasses. Half of
has to be applied to this image in order to recover them were male, and the other half were female.
the original image. The deformation is not Head hair was digitally erased from the models.
encoded in a parameterized form. Rather, it sim- For details about the acquisition and the prepro-
ply describes for each pixel i in the image the dis- cessing of the models, see Troje and Bülthoff
placement vector (dxi, dyi) necessary to match the (1996).
corresponding pixel in the original image. The The images showed the faces from a frontal
rule that decodes the image from this representa- view. The orientations of the head models in 3D
tion simply reads as follows: Draw the image space were aligned to each other by minimizing
given by the first part of the vector and apply the the sum-squared distances between corresponding
deformation field given by the second part of the locations of a set of selected features such as the
vector. We will refer to the first part of the vector pupils, the tip of the nose, and the corners of the
as the texture of the face and to the second part as mouth. Images were black and white and had a
the shape of the face. size of 256x256 pixels with a resolution of 8 bits.
The correspondence-based representation is
completely smooth and convex in the sense that a 3.2 Principal component analysis
linear combination of two or more faces cannot be Principal component analysis (PCA) is a tool
identified as being synthetic. The mean of two that has been widely used to reduce the dimen-
faces results in a face with the mean texture sionality of a given data set. PCA is based on the
applied to the mean shape. In computer graphics Karhunen-Loeve expansion -- a linear transforma-
this hybrid face is often referred to as the morph tion resulting in an orthogonal basis with the axes
between the two faces. ordered according to their contribution to the
In fact, any inner1 linear combination of exist- overall variance of the data set. Truncating the
ing textures reveals a new valid texture and any expansion yields low-dimensional representations
inner linear combination of existing shapes of the data with a minimized mean squared error
reveals a new valid shape. Furthermore, any valid (Ahmed & Goldstein, 1975).
texture can be combined with any valid shape to PCA was first used with images of faces by
reveal a new face. The subspaces coding for tex- Sirovich and Kirby (1987) and has been applied
ture and for shape can thus be treated indepen- successfully to different tasks, such as face recog-
dently. nition (Turk & Pentland, 1991; O’Toole, Abdi,
A critical element of this approach is establish- Deffenbacher, & Valentin, 1993; Abdi, Valentin,
ing pixel-by-pixel correspondence between the Edelman, & O’Toole, 1995) and gender classifi-
sample face and the common prototype. We used cation (O’Toole, Abdi, Deffenbacher, & Barlett,
a coarse-to-fine gradient-based optical flow algo- 1991). In all of these investigations, PCA was
rithm (Adelson & Bergen, 1986) applied to the applied directly to the pixel-based representation
Laplacians of the images following an implemen- of images, which were only aligned by means of
tation described in Bergen and Hingorani (1990). simple transformations (translation, Sirovich and
The Laplacian of the images were computed from Kirby also used scaling) that do not change the
the Gaussian pyramid adopting the algorithm pro- character of the face.
posed by Burt and Adelson (1983). For more Vetter and Troje (1995) applied PCA to the cor-
details, see Vetter and Troje (1995). respondence-based representation of faces. For
the present investigation, we used the same tech-
3 The quality of low-dimensional nique, applying PCA separately to the subspaces
reconstructions that code for the texture and for the shape of the
faces, respectively. In addition, we ran PCA on
3.1 Images the images themselves.
The images of the faces were generated from a
data base of 100 three-dimensional head models 3.3 Theoretical evaluation of the reconstructions
obtained by using a 3D laser scanner. All head PCA yields an orthogonal basis with the axes
models were sampled from persons between 20 ordered according to their overall variance. The
principal components equal the eigenvectors of
the covariance matrix of the data. The corre-
1. That means, a linear combination in which
all coefficients sum up to one.
sponding eigenvalues are equal to the variances

5
along each component. The decrease of the vari- base.
ances associated with the principal components In Figure 5b, the quality of the reconstructions
indicates the applicability of PCA for dimension- resulting from this procedure is illustrated. The
ality reduction. plot shows the generalization performance of the
In Figure 5a, we plotted one minus the relative different representations in terms of the testing
cumulative variance accounted for by the first k error. Like the training error, the testing error is
principal components for the three different defined by the mean squared difference between
PCAs. The relative cumulative variances were reconstruction and original image divided by the
calculated by successively summing up the first k variance σ2 of the whole data set:
eigenvalues υi and dividing them by the sum of all
1
testing error k = --------2- ∑ ( X k – X )
2
eigenvalues: (3)
nσ n
∑ υi The testing error using the pixel-based repre-
k
training error k = 1 – ----------
- (1) sentation is never smaller than 28%, even if all 98
∑ υi principal components are used for the reconstruc-
n
tion. A testing error of 28% is reached with only 5
This term is equivalent to the expected value
principal components for the texture space and 5
for the mean squared distance between a recon-
principal components for the shape space. If all
struction Xk and the original image X divided by
principal components are used, the testing error
the overall variance σ2. By Xk we denote the
can be reduced to 6% for the shape and to 12% for
reconstruction yielded using only the first k prin-
the texture.
cipal components.
A single image of a face can be used to code
∑ υi either one principal component in the pixel-based
representation or one principal component of the
k
training error k = 1 – ----------
-
∑ υi (2)
shape subspace and one principal component of
the texture subspace of the correspondence-based
n
1 representation. Thus the information contained in
-∑(Xk – X )
2
= ----------------------
2 five images is enough to code for 72% of the vari-
σ (n – 1) n ance in a correspondence-based representation,
whereas 98 images are needed in the pixel-based
It is thus an appropriate measure for the recon- representation.
struction error. Since it depends on the set of faces The reconstruction errors in Figures 5a and 5b
used to construct the principal component space were measured in terms of the squared Euclidian
from which the reconstructions were made, we distance between reconstruction and original in
call this kind of error the training error. the respective representation. To make the three
For a training error of 10% (i.e. to recover 90% distances comparable, we normalized them with
of the overall variance), the first 47 principal com- respect to the overall variance of the data base in
ponents are needed in the pixel-based representa- the respective representation. Texture and shape
tion, 22 principal components are needed in the parts of the correspondence-based representation
texture representation, and 15 are needed in the were treated separately.
shape representation. Because the test face was To directly compare the reconstruction quali-
contained in the set from which the principal ties achieved with the pixel-based and with the
components were derived, the training error correspondence-based representation, we com-
approaches zero when using all available princi- bined reconstructed texture and reconstructed
pal components for the reconstruction. shape to yield a reconstructed image. This was
To evaluate how well the representation gener- done by applying the deformation field, coded in
alizes to new faces, we performed a leave-one-out the reconstructed shape to the images coded in the
procedure in which one face was taken out of the reconstructed texture. The distance between this
data base and PCA was performed on the remain- reconstruction and the corresponding original
ing 99 faces yielding 98 principal components. image can be measured by means of the squared
Then, the single face was projected into various Euclidian distance in the pixel-based image space,
principal component subspaces ranging from and thus in the same space, and with the same
dimensionality k=1 to 98 to yield the reconstruc- metric as the reconstruction error of the pixel-
tion Xk . This was done for every face in the data based representations. Figure 5c shows the results

6
1.0 Fig. 5: (a) Training error. In this diagram,
pixel−based one minus the relative cumulative
0.8 corresp.−based (texture) variance has been plotted. The cumulative
corresp.−based (shape)
variance is equal to the mean of the
Training Error

0.6
squared Euclidian distance between the
original face and reconstructions derived
0.4
by truncating the principal component
expansion. The calculation was performed
0.2
for the two parts of the correspondence-
based representation and for the pixel-
0.0
0.0 20.0 40.0 60.0 80.0 100.0 based representation.
Number of Principal Components
(b) Testing error (A). The relative mean
squared Euclidian distance between the
1.0
original and its reconstructions. In this
pixel−based case, the reconstruction was derived by
0.8 corresp.−based (texture)
projecting the data into spaces spanned by
Testing Error (A)

corresp.−based (shape)

0.6
principal components computed from the
set of remaining faces which did not
0.4 contain the original face. The calculation
was performed for the two parts of the
0.2 correspondence-based representation and
for the pixel-based representation.
0.0 (c) Training error (B). As for the
0.0 20.0 40.0 60.0 80.0 100.0
Number of Principal Components calculation of testing error A the faces
were projected into principal component
1.0 spaces derived from the remaining faces.
pixel−based The error for the pixel-based
0.8 corresp.−based representation is the same as the one
Testing Error (B)

plotted in Figure 5. The error


0.6
corresponding to the correspondence-
based representation is measured by the
0.4
squared Euclidian distance in the pixel
space after combining the reconstructed
0.2
shape with the reconstructed texture to
yield an image (for details, see text).
0.0
0.0 20.0 40.0 60.0 80.0 100.0
Number of Principal Components

of this calculation. To achieve a reconstruction ferences between faces is not at all homogeneous
error of 28% - the best that can be reached with 99 within the whole image. Changes in the region of
faces using a pixel-based representation - only 12 the eyes are more likely to be detected than
principal components have to be used in the cor- changes of the same size (with respect to any of
respondence-based representation. If all principal our distance measures) in the region of the ears.
components of the correspondence-based repre- Since it seems to be very difficult to formulate an
sentation are used, a reconstruction error of 13% image distance that exactly reflects human dis-
can be achieved. crimination performance, we use human discrimi-
nation performance directly and evaluate the
3.4 Psychophysical evaluation of the reconstruc- reconstruction quality by means of a psychophys-
tions ical experiment.
In the experiment, subjects were simulta-
Purpose neously presented with three images on a com-
The above distance measures are all based on puter screen. In the upper part of the screen, an
the Euclidian distance in the different face spaces original face from our data base was shown.
used. These distances, however, might only Below this target face, two further images were
approximately reflect the perceptual distance used shown. One of them was again the same original
by the human face recognition system. Consider, target face, the other was a reconstruction of it.
for instance, the fact that human sensitivity to dif- The subjects indicated which of the two lower

7
20

TEX 6 TEX
SHP SHP
15 BTH BTH

Response Time [s]


PIX PIX
Error [%]

4
10

2
5

0 0
REC05 REC15 REC98 REC05 REC15 REC98

Fig. 6: Psychophysical evaluation of the different kinds of reconstructions. Error rates (a) and response times
(b) are plotted. TEX: Reconstructed texture combined with original shape. SHP: Reconstructed shape
combined with original texture. BTH: Reconstructed texture combined with reconstructed shape. PIX:
Reconstruction in the pixel-based space. REC05: Reconstructions based on the first 5 principal components.
REC15: Reconstructions based on the first 15 principal components. REC98: Reconstructions based on all 98
principal components.
images was identical to the upper one. The time subjects factor named MODE that had the four
they needed for this task makes an issue about the levels TEX, SHP, BTH, and PIX. TEX corre-
reconstruction quality. sponds to trials using images with only the texture
reconstructed, SHP to trials with only the shape
Methods reconstructed, BTH to trials with both recon-
The reconstructions tested in this experiment structed texture and shape, and PIX to trials using
were all made by projecting faces into spaces reconstructions in the pixel-based space.
spanned by the principal components derived Twenty four subjects were randomly divided
from all the other faces in our data base. We thus into four groups, each assigned to one of the lev-
used the same “leave-one-out” procedure as els of the factor MODE. Each subject performed
described in the context of calculating the testing 3 blocks. Each block contained 100 trials using
error (see previous section). Four different kinds either REC05, REC15 or REC98 reconstructions.
of reconstructions were used. To investigate the The order of the blocks was completely counter-
reconstruction quality within the texture subspace balanced. There are six possible permutations and
we combined reconstructed textures with the orig- each of them was used once for one of the six sub-
inal shape. Similarly, we showed images with jects in each group. Each of the 100 faces was
reconstructed shape in combination with the orig- used exactly once in each block.
inal texture. The third kind of reconstruction was Each stimulus presentation was preceded by a
made from a combination of reconstructed shape fixation cross that was presented for 1 sec. Then,
and reconstructed texture. Finally, we used recon- the three images were simultaneously presented
structions using the principal components derived on the computer screen. Together they covered a
from the pixel-based representation. In each of the visual angle of 12 degrees. The subject indicated
four reconstruction modes, reconstructions using which of the two bottom images was identical
the first 5, 15 and all 98 principal components with the image on the top by pressing either the
were shown. We chose these values because 5 and left or the right arrow key on the keyboard. Sub-
15 principal components cover approximately one jects were instructed to respond “as accurately
and two thirds, respectively, of the overall vari- and as quickly as possible”. The images were pre-
ance. sented until the subject pressed one of the
A two-factor mixed block design was used. response keys. We measured the subjects error
The first factor was a within-subject factor named rate as well as the time they needed to perform the
QUALITY that coded for the quality of the recon- task.
struction. It had the levels REC05, REC15 and
REC98, corresponding to reconstructions made Results
by using either only 5, 15 or of all 98 principal Figure 6 illustrates the results of this experi-
components. The second factor was a between- ment. Accuracy was generally very high as

8
expressed by the low error rates (mean: 5.9%) and the same image quality is strongly reduced.
differences due to the factor MODE did not reach Human observers could discriminate a recon-
significance (two-factor ANOVA on the error rate, struction derived from the pixel-based representa-
F3,20 = 1.49, p > 0.05). We found an increase in tion much faster from the original face than a
the error rate with the number of principal compo- reconstruction derived from the correspondence-
nents used for the reconstruction (main effect of based representation. The results from the psy-
the factor QUALITY: F2,40 = 14.05, p < 0.01) and chophysical experiments are important, since it is
no interaction between the two factors. well known that the Euclidian distance used to
The response times were effected strongly by optimize the reconstructions as well as to com-
both the factor MODE (F3,20 = 10.9, p < 0.01) pute the principal components by itself does not
and the factor QUALITY (F2,40 = 21.8, p < 0.01). in general reflect perceived image distance (Xu &
The interaction between the factors was margin- Hauske, 1994).
ally significant (F6,40 = 2.6, p < 0.05). The mean
response time needed to discriminate between an 4 Sex classification
original image and its reconstruction in the pixel-
based representation (condition PIX) was 606 4.1 Purpose
msec. The mean response times in conditions According to the criteria developed in section
TEX and SHP were 3488 msec and 3385 msec, 1, we would expect an efficient and flexible repre-
respectively. In condition BTH the mean response sentation of faces to cluster together groups of
time was 1872 msec. In all four conditions of the images that share common attributes. Images
factor MODE, response times increased with the showing the same individual should be closer to
number of principal components, although only each other than images of different faces, accord-
very slightly in condition PIX. Note that the time ing to some simple metric. Also, images of faces
needed to identify the worst reconstruction in the of the same age, gender or race should cluster
correspondence-based representation (BTH, according to other metrics.
REC05) from the original was still almost twice In this section, we investigate how well a sim-
the time needed for the best reconstruction in the ple linear classifier can distinguish between male
pixel-based space (PIX, REC98). and female faces, using either the pixel-based or
the correspondence-based representation as an
3.1 Reconstruction quality and coding efficiency input. To examine how robust the respective rep-
The results clearly demonstrate an improve- resentations are against miss-alignment of the
ment in the coding efficiency and generalization faces, we generated different image sets differing
to new face images of the correspondence-based in the degree of their mutual alignment.
image representation over pixel-based techniques
previously proposed (Kirby & Sirovich, 1990; 4.2 Material and Methods
Turk & Pentland, 1991). The correspondence, An extended data base consisting of 200 three-
here computed automatically using an optical dimensional head models was used for these sim-
flow algorithm, allows the separation of two- ulations. Half of them were male and half of them
dimensional shape and texture information in were female. Preprocessing of the models was
images of human faces. The image of a face is performed as described in section 3.1. The initial
represented by its projection coefficients in sepa- alignment was also performed as described previ-
rate linear vector spaces for shape and texture. ously and frontal view images were rendered. In
The improvement was demonstrated computa- addition, we rendered two other sets of images by
tionally as well as in a psychophysical experi- systematically misaligning the heads. For the first
ment. set, we applied small translations adding Gauss-
The results of the different evaluations indicate ian noise with a standard deviation of 0.5 cm (cor-
the utility of the proposed representation as an responding to 5 pixels in the image) to the
efficient coding of face images. We have demon- position of the head in the image plane. For the
strated the coding efficiency within a given set of second set, we misaligned the faces by applying
images as well as the generalizability to new test small rotations in 3D space before rendering the
images not contained in the data set from which images. We added Gaussian noise with a standard
the representations were originally obtained. In deviation of 3 degrees to the orientation of the
comparison to a pixel-based image representation, head around the vertical axis and around the hori-
the number of principal components needed for zontal axis perpendicular to the line of sight.

9
Correspondence-based representations were recorded if sgn(aitest ) ≠ sgn( â ).
computed from the aligned face set and from the In addition, we ran three further simulations to
set that had been misaligned by rotations in 3D. classify the three different image sets by using the
The correspondence-based representation of the coefficients corresponding to the first 25 principal
image set misaligned by small translations was components of the shape subspace together with
derived directly by adding the constant translation the first 25 coefficients for the texture subspace as
vector to the flow field. input data.
This procedure yielded nine different data sets: For all the simulations, the mean error of the
Each of the three sets of images existed now in two reciprocal simulations (exchanging training
terms of the pixel-based representation and in and test sets) is reported.
terms of the texture and the shape part of the cor-
respondence-based representation. 4.3 Results
Sex classification was performed on each of the In Figure 7, the generalization errors resulting
nine data sets in the following way: from the classification experiments are presented.
The 200 faces were randomly divided into two Using the images showing the faces previously
groups (A and B), each containing 50 males and aligned in 3D, classification on the pixel-based
50 females. Two simulations were run. In the first, representation yielded a relatively low error rate
group A served as the training set and group B as of 4%. Using only the texture subspace of the cor-
the test set. In a second simulation the two groups respondence-based representation, the error was
were exchanged using group B for training and somewhat higher (5.5%). With only the shape
group A for testing. Each simulation began with subspace, the error rate was 3%. The best classifi-
the calculation of a principal component analysis cation (error rate 2%) was obtained when combin-
on the training set. Then both training and test ing the coefficients corresponding to the first 25
sets were projected on the first 50 principal com- principal components of the texture subspace with
ponents, yielding 50 coefficients for each face. the coefficients corresponding to the first 25 prin-
We also tested the performance of the classifier cipal components of the shape subspace.
when using all 99 principal components, but the Using images of faces that were misaligned,
results were never better than with 50 compo- the classification performance for the pixel-based
nents. representation dropped significantly. In the first
The 50 coefficients were used as input for a lin- example in which a misalignment was introduced
ear classifier. The classifier itself was formulated
as a linear system:
15
train train
a = ωP + ω0 (4) aligned
train misaligned (translation)
P is a matrix containing the coefficients of misaligned (rotation)
the ith face of the training set in the ith column. 10
atrain is a row vector containing the desired out-
Error [%]

puts (aitrain = 1 if Pitrain male, aitrain = -1 if Pitrain


female). ω is the row vector containing the
weights corresponding to the first 50 principal 5
components and ω0 accounts for a constant bias.
The coefficients ωi were optimized by using sin-
gular value decomposition in order to minimize
the sum-squared error between the desired and the 0
PIX TEX SHP BTH
actual output. Note that this is equivalent to train- Figure 7: Generalization errors of the sex classifica-
ing a simple perceptron with linear transfer func- tion experiments on three different image sets (see
tion. text). As input for the classifier, the coefficients corre-
After training, the test set was projected on the sponding to the first 50 principal components derived
vector ω: from the pixel-based representation (PIX), the “tex-
ture” part (TEX) and the “shape” part (SHP) of the
test correspondence-based representation were used.
â = ωP + ω0 (5)
Additionally, an input vector consisting of the first 25
The output â was compared with the desired principal components of “texture” and “shape” was
output atest to yield the error rate. An error was used (BTH).

10
by applying a small translation to the images, the Clearly, the crucial step in the proposed tech-
error rate was 12%. In the other example in which nique is a dense correspondence field between the
the misalignment was due to small rotations in images of the faces. The optical flow technique
depth, an error rate of 8% was obtained. The cor- used on our data set worked well; however, for
respondence-based representation was much less images obtained under less controlled conditions
effected by misalignment. The error rates for the a more sophisticated method for finding the corre-
classification using only the texture stayed con- spondence might be necessary. New correspon-
stant, the ones for the classification using the dence techniques based on active shape models
shape increased only slightly. If a combination of (Cootes et al., 1995, Jones & Poggio, 1995) are
the first principal components of texture and more robust against local occlusions and larger
shape was used, the error rates also only slightly distortions when applied to a known object class.
increased. Their shape parameters are optimized actively to
model the target image. These techniques thus
4.4 Discussion incorporate knowledge specific to the object class
The advantage of the correspondence-based directly into the correspondence computation.
representation is striking when using images of The main result of this paper is that an image
faces that are misaligned. The good performance representation in terms of separated shape and
on the classification for the pixel-based aligned texture is superior to a pixel-based image repre-
images, however, shows that the full pixel-by- sentation for performing many useful tasks. Our
pixel correspondence is not needed for sex classi- results complement other findings in which a sep-
fication. What is needed is only enough informa- arate texture and shape representation of three-
tion to perform an alignment of the heads in dimensional objects in general was used for visual
space. Sex classification is probably a relatively learning (Beymer & Poggio, 1996), enabling the
easy task compared with other classification tasks synthesis of novel views from a single image
such as the classification of facial expression or (Vetter & Poggio, 1996). Finally, based on our
the identification of a person. For these latter psychophysical experiments, we suggest that the
tasks, the advantage of the correspondence-based correspondence-based representation of faces is
representation is expected to be even more pro- much closer to a human description of faces than
nounced and an optimal rigid alignment in 3D is a pixel-by-pixel comparison of images, which dis-
probably not sufficient. regards the spatial correspondence of features.

5 General Discussion References


We contrasted the properties of a correspon- Abdi, H., Valentin, D., Edelman, B. and O’Toole,
dence-based representation of images of human A.J. (1995) “More about the difference between
faces with pixel-based techniques. The motivation men and women: Evidence from linear neural net-
behind developing the correspondence-based rep- works and the principal component approach”,
resentation was the lack of convexity of the pixel- Perception 24:539-562.
based representation. The correspondence-based
representation copes with this problem by Adelson, E.H. and J.R. Bergen (1986) “The
employing pixel-by-pixel correspondence to per- extraction of spatiotemporal energy in human and
fectly match the images. This results in a repre- machine vision”, Proc. IEEE Workshop on Visual
sentation separating texture and shape Motion, Carlston, pp. 151-156.
information.
We compared low-dimensional reconstruc- Ahmed, N. and Goldstein, M.H. (1975) Orthogo-
tions derived from correspondence-based and nal Transforms for Digital Signal Processing,
pixel-based representations to demonstrate the New York: Springer.
advantage of the correspondence-based represen- Bergen, J.R. and Hingorani, R. (1990) “Hierarchi-
tation for efficient coding and modelling. Finally, cal motion-based frame rate conversion”, Techni-
we tested the different representations in a simple cal report, David Sarnoff Research Center
classification task. We trained a linear network to Princeton NJ 08540.
classify the sex of faces in a training set and tested
for generalization performance using a separate Beymer, D. and Poggio, T. (1996) “Image repre-
testing set of faces. sentation for visual learning”, Science 272:1905-

11
1909. eds., Proceedings of the thirteenth annual confer-
ence of the Cognitive Science Society, Hillsdale,
Beymer, D., Shashua, A. and Poggio, T (1993) NJ: Lawrence Erlbaum Associates, pp. 847-851.
“Example-based image analysis and synthesis”,
Artificial Intell. Lab., Massachusetts Inst. Tech- Sirovich L. and Kirby, M. (1987) “Low-dimen-
nol., Cambridge, Rep. A.I.M. 1431. sional procedure for the characterization of
human faces”, Journal of the Optical Society of
Burt, P.J. and Adelson, E.H. (1983) “The Lapla- America A 4:519-554.
cian pyramid as a compact image code”, IEEE
Transactions on Communications 31:532-540. Troje, N. and Bülthoff, H.H. (1995) “Face recog-
nition under varying pose: The role of texture and
Cootes, T.F., Taylor, C.J., Cooper, D.H. and Gra- shape”, Vision Research 36:1761-1771.
ham, J. (1995) “Active shape models - their train-
ing and application”, Computer Vision and Image Turk, M. and Pentland, A. (1991) “Eigenfaces for
Understanding 61:38-59. recognition”, Journal of Cognitive Neuroscience
3:71-86.
Costen, N., Craw, I., Robertson, G. and Aka-
matsu, S. (1996) “Automatic face recognition: Vetter, T. and Poggio, T. (1996) “Image synthesis
What representation”, in: B. Buxton and R. Cip- from a single example image”. In B. Buxton and
pola, eds., Computer Vision - ECCV’96, Lecture R. Cippola, eds., Computer Vision - ECCV’96,
Notes in Computer Science 1064, Cambridge UK: Lecture Notes in Computer Science 1064, Cam-
Springer, pp. 504-513. bridge UK: Springer, pp. 652-659.
Craw, I. and Cameron, P. (1991) “Parameterizing Vetter, T. and Troje, N. (1995) “Separation of tex-
images for recognition and reconstruction”, Proc. ture and two-dimensional shape in images of
British Machine Vision Conference, pp. 367-370. human faces”, in: S. Posch, F. Kummert, and G.
Sagerer, eds., Mustererkennung 1995, New York:
Hancock, P.J.B., Burton, A.M. and Bruce, V. Springer, pp. 118-125.
(1996) “Face processing: Human perception and
principal components analysis”, Memory and Vetter, T. (1996) “Synthesis of novel views from a
Cognition 24:26-40. single face image”, Max-Planck-Institut für biolo-
gische Kybernetik, Tübingen, Germany, Technical
Jones, M. and Poggio, T. (1995) “Model-based Report 26.
matching of line drawings by linear combination
of prototypes”, in: Proceedings of the 5th Interna- Xu W. and Hauske, G. (1994) “Picture quality
tional Conference on Computer Vision, pp. 531- evaluation based on error segmentation”, Proc.
536. SPIE, Visual Communications and Image Pro-
cessing 2308:1-12.
Kirby, M. and Sirovich, L. (1990) “Application of
the Karhunen-Loewe procedure for characteriza-
tion of human faces”, IEEE Transactions on Pat-
tern Analysis and Machine Intelligence 12:103-
109.
Perrett, D.I., May, K.A. and Yoshikawa, S. (1994)
“Facial shape and judgements of female attrac-
tiveness”, Nature 368:239-242.
O’Toole, A.J., Abdi, H., Deffenbacher, K.A. and
Valentine, D. (1993) “Low-dimensional represen-
tation of faces in higher dimensions of the face
space”, Journal of the Optical Society of America
A 10:405-411.
O’Toole, A.J., Abdi, H., Deffenbacher, K.A. and
Barlett, J.C. (1991) “Classifying faces by face and
sex using an autoassociative memory trained for
recognition”, in: K.J. Hammond and D. Gentner,

12

You might also like