Viola Jones Multiview Face
Viola Jones Multiview Face
Viola Jones Multiview Face
http://www.merl.com
TR2003-96
August 2003
Abstract
This paper extends the face detection framework proposed by Viola and Jones 2001 to handle
profile views and rotated faces. As in the work of Rowley et al. 1998, and Schneiderman et al.
2000, we build different detectors for different views of the face. A decision tree is then trained to
determine the viewpoint class (such as right profile or rotated 60 degrees) for a given window of
the image being examined. This is similar to the approach of Rowley et al. 1998. The appropriate
detector for that viewpoint can then be run instead of running all detectors on all windows. This
technique yields good results and maintains the speed advantage of the Viola-Jones detector.
This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part
without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include
the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of
the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or
republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All
rights reserved.
c Mitsubishi Electric Research Laboratories, Inc., 2003
Copyright
201 Broadway, Cambridge, Massachusetts 02139
MERLCoverPageSide2
Publication History:
1. First printing, TR2003-96, July 2003
Paul Viola
[email protected]
Microsoft Research
One Microsoft Way
Redmond, WA 98052
Abstract
This paper extends the face detection framework proposed
by Viola and Jones 2001 to handle profile views and rotated
faces. As in the work of Rowley et al 1998. and Schneiderman et al. 2000, we build different detectors for different
views of the face. A decision tree is then trained to determine the viewpoint class (such as right profile or rotated
60 degrees) for a given window of the image being examined. This is similar to the approach of Rowley et al. 1998.
The appropriate detector for that viewpoint can then be run
instead of running all detectors on all windows. This technique yields good results and maintains the speed advantage
of the Viola-Jones detector.
1. Introduction
There are a number of techniques that can successfully
detect frontal upright faces in a wide variety of images
[11, 7, 10, 12, 3, 6]. While the definition of frontal and
upright may vary from system to system, the reality is that
many natural images contain rotated or profile faces that
are not reliably detected. There are a small number of systems which explicitly address non-frontal, or non-upright
face detection [8, 10, 2]. This paper describes progress toward a system which can detect faces regardless of pose
reliably and in real-time.
This paper extends the framework proposed by Viola and
Jones [12]. This approach is selected because of its computational efficiency and simplicity.
One observation which is shared among all previous related work is that a multi-view detector must be carefully
constructed by combining a collection of detectors each
trained for a single viewpoint. It appears that a monolithic
approach, where a single classifier is trained to detect all
poses of a face, is unlearnable with existing classifiers. Our
informal experiments lend support to this conclusion, since
a classifier trained on all poses appears to be hopelessly inaccurate.
This paper addresses two types of pose variation: nonfrontal faces, which are rotated out of the image plane, and
non-upright faces, which are rotated in the image plane.
In both cases the multi-view detector presented in this paper is a combination of Viola-Jones detectors, each detector
trained on face data taken from a single viewpoint.
Reliable non-upright face detection was first presented
in a paper by Rowley, Baluja and Kanade [8]. They train
two neural network classifiers. The first estimates the pose
of a face in the detection window. The second is a conventional face detector. Faces are detected in three steps: for
each image window the pose of face is first estimated; the
pose estimate is then used to de-rotate the image window;
the window is then classified by the second detector. For
non-face windows, the poses estimate must be considered
random. Nevertheless, a rotated non-face should be rejected
by the conventional detector. One potential flaw of such a
system is that the final detection rate is roughly the product
of the correct classification rates of the two classifiers (since
the errors of the two classifiers are somewhat independent).
One could adopt the Rowley et al. three step approach
while replacing the classifiers with those of Viola and Jones.
The final system would be more efficient, but not significantly. Classification by the Viola-Jones system is so efficient, that derotation would dominate the computational expense. In principle derotation is not strictly necessary since
it should be possible to construct a detector for rotated faces
directly. Detection becomes a two stage process. First the
pose of the window is estimated and then one of rotation
specific detectors is called upon to classify the window.
In this paper detection of non-upright faces is handled
using the two stage approach. In the first stage the pose of
each window is estimated using a decision tree constructed
using features like those described by Viola and Jones. In
pose specific Viola-Jones detethe second stage one of
tectors are used to classify the window.
Once pose specific detectors are trained and available,
an alternative detection process can be tested as well. In this
case all detectors are evaluated and the union of their de-
tections are reported. We have found that this simple tryall-poses system in fact yields a slightly superior receiver
operating characteristics (ROC) curve, but is about times
slower. This contradicts a conjecture by Rowley in his thesis, which claims that the false positive rate of the try-allposes detector would be higher.
Both for the estimation of pose, and for the detection of
rotated faces, we found that the axis aligned rectangle features used by Viola and Jones were not entirely sufficient.
In this paper we present an efficient extension of these features which can be used to detect diagonal structures and
edges.
In this paper we have also investigated detectors for nonfrontal faces (which include profile faces). Using an identical two stage approach, an efficient and reliable non-frontal
face detector can be constructed. We compare our results
with those of Schneiderman et al. which is one of the few
successful approaches to profile face detection. Li et al. [2]
also built a profile detector based on a modification of the
Viola-Jones detector, but did not report results on any profile test set.
The main contributions of this paper are:
A new set of rectangle features that are useful for rotated face detection.
The remainder of this paper i) reviews the face detection framework and our extensions; ii) describes the pose
estimation classifier; iii) describes our experimental results;
and iv) concludes with a discussion.
00000
11111
11111
00000
00000
11111
00000
11111
00000
11111
00000
11111
00000
11111
00000
11111
111
000
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
111
000
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
Figure 1: Example rectangle filters shown relative to the enclosing detection window. The sum of the pixels which lie within the
white rectangles are subtracted from the sum of pixels in the gray
rectangles.
Following [12] image features are called Rectangle Features and are reminiscent of Haar basis functions (see [4]
for the use of Haar basis functions in pedestrian detection).
Each rectangle feature, is binary threshold function
constructed from a threshold and a rectangle filter
which is a linear function of the image:
if
otherwise
Here is a pixel sub-window of an image. Following Schapire and Singer [9], and are positive or
negative votes of each feature set by Adaboost during the
learning process.
Previously Viola and Jones used three types of rectangle
filters. The value of a two-rectangle filter is the difference
between the sum of the pixels within two rectangular regions. The regions have the same size and shape and are
horizontally or vertically adjacent (see Figure 1 A and B).
A three-rectangle filter computes the sum within two outside rectangles subtracted from twice the sum in a center
rectangle (see C). Finally a four-rectangle filter computes
the difference between diagonal pairs of rectangles (see D).
Given that the base resolution of the classifier is 24 by 24
pixels, the exhaustive set of rectangle filters is quite large,
over 100,000, which is roughly where N=24 (i.e.
the number of possible locations times the number of possible sizes). The actual number is smaller since filters must fit
within the classification window. Note that unlike the Haar
basis, the set of rectangle features is overcomplete 1.
Computation of rectangle filters can be accelerated using
an intermediate image representation called the integral image [12]. Using this representation any rectangle filter, at
any scale or location, can be evaluated in constant time.
Figure 2: Example diagonal filters shown relative to the enclosing detection window. The sum of the pixels that lie within the
light gray area is subtracted from the sum of pixels in the dark
gray area. The two basic types of diagonal filters are shown in A
and C. A diagonal filter is constructed from 4 rectangles as illustrated in B and D. The base rectangles can be any aspect ratio and
any size that fits within the detectors query window.
if
otherwise
2.4. Cascade
In order to greatly improve computational efficiency and
also reduce the false positive rate, Viola and Jones use a
sequence of increasingly more complex classifiers called a
cascade. An input window is evaluated on the first classifier
of the cascade and if that classifier returns false then computation on that window ends and the detector returns false.
If the classifier returns true then the window is passed to the
next classifier in the cascade. The next classifier evaluates
the window in the same way. If the window passes through
every classifier with all returning true then the detector returns true for that window. The more a window looks like a
face, the more classifiers are evaluated on it and the longer
it takes to classify that window. Since most windows in an
image do not look like faces, most are quickly discarded as
non-faces. Figure 3 illustrates the cascade.
3. Pose estimator
2.3. Classifier
T
1
T
2
T
3
T
4
Further
Processing
Reject Subwindow
window, its output can be considered random. Any detector chosen to evaluate on a non-face window should return
false. As mentioned earlier this approach is very similar to
that of Rowley et al. [8].
4. Experiments
4.1. Non-upright Faces
The frontal face detector from Viola-Jones handles approximately degrees of in-plane rotation. Given this, we
trained 12 different detectors for frontal faces in 12 different rotation classes. Each rotation class covers 30 degrees
of in-plane rotation so that together, the 12 detectors cover
the full 360 degrees of possible rotations.
4.1.1 Detectors for each rotation class
The training sets for each rotation class consist of 8356
faces of size pixels and over 100 million background (non-face) patches. The face examples are the same
for each rotation class modulo the appropriate rotation. In
practice, we only needed to train 3 detectors. One for 0 degrees (which covers -15 degrees to 15 degrees of rotation),
one for 30 degrees (which covers 15 degrees to 45 degrees)
and one for 60 degrees (which covers 45 degrees to 75 degrees). Because the filters we use can be rotated 90 degrees,
any detector can also be rotated 90 degrees. So a frontal
face detector trained at 0 degrees of rotation can be rotated
to yield a detector for 90 degrees, 180 degrees and 270 degrees. The same trick can be used for the 30 degree and 60
degree detectors to cover the remaining rotation classes.
All of the resulting face detectors coincidentally turned
out to have 35 layers of classifiers. They all took advantage of diagonal features (although for the frontal, upright
detector, the added diagonal features did not improve the
accuracy of the detector over previously trained versions).
Training was stopped after new cascade layers ceased to
significantly improve the false positive rate of the detector
without significantly reducing its detection rate. This happened to be after 35 layers in all three cases.
4.1.2 Pose estimator
For the pose estimator we trained a decision tree with 1024
internal nodes (11 levels) to classify a frontal face into one
of the 12 rotation classes. The decision tree was trained using 4000 faces (also of size pixels) for each of the
12 rotation classes. The set of faces for a particular rotation
class used to train the decision tree was a subset of the faces
used to train a single face detector (4000 of the 8356 faces).
The limit of 4000 was imposed by memory and speed constraints of the decision tree training algorithm.
The resulting decision tree achieves 84.7% accuracy on
the training set and 76.6% accuracy on a test set (consisting
of the remaining 4356 faces for each rotation class). Because the pose estimator has multiple chances to predict the
rotation class for each face when scanning, the loss in detection rate per face is much less than 23.4% (100% - 76.6%).
Figure 4 also shows the ROC curve for the try-allrotations detector. The try-all-rotations detector is only
slightly more accurate but as noted above is nearly 5 times
slower.
It is not obvious how to manipulate the detectors to get
different points on the ROC curve since each detector is a
cascade of classifiers each of which has a threshold and we
have multiple such detectors. We use the same strategy as
in Viola-Jones [12]. To decrease the false positve rate (and
decrease the detection rate), the threshold of the final classifier is increased. To increase the detection rate (and increase
the false positive rate), classifier layers are removed from
the end of the cascade. This is done simultaneously for all
of the detectors.
0.94
0.93
Detection rate
0.92
0.91
0.9
0.89
0.88
0.87
200
400
600
800
1000
1200
Figure 4: ROC curve showing the performance on the nonupright test set for both the two stage detector (which is
much faster) and a detector that runs all 12 cascades on every image window.
References
[1] Yoav Freund and Robert E. Schapire. A decision-theoretic
generalization of on-line learning and an application to
boosting. In Computational Learning Theory: Eurocolt 95,
pages 2337. Springer-Verlag, 1995.
[2] S. Li, L. Zhu, Z.Q. Zhang, A. Blake, H.J. Zhang, and
H. Shum. Statistical learning of multi-view face detection.
In Proceedings of the 7th European Conference on Computer
Vision, Copenhagen, Denmark, May 2002.
[3] Edgar Osuna, Robert Freund, and Federico Girosi. Training
support vector machines: an application to face detection.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 1997.
[4] C. Papageorgiou, M. Oren, and T. Poggio. A general framework for object detection. In International Conference on
Computer Vision, 1998.
5. Conclusions
[5] J. Ross Quinlan. C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., 1993.
0.9
Detection rate
[11] K. Sung and T. Poggio. Example-based learning for viewbased face detection. In IEEE Patt. Anal. Mach. Intell., volume 20, pages 3951, 1998.
0.85
0.8
0.75
0.7
200
400
600
800
1000
1200
Figure 7: ROC curve showing the performance on the profile test set for both the two stage detector and a detector
that runs both left and right profile cascades on every image
window.