Face As Mouse Through Visual Face Tracking
Face As Mouse Through Visual Face Tracking
Face As Mouse Through Visual Face Tracking
Proceedings of the Second Canadian Conference on Computer and Robot Vision (CRV’05)
0-7695-2319-6/05 $ 20.00 IEEE
tion. For nose tracking, [8] claimed that defining nose as face tracking approach[17] to retrieve facial motion para-
an extremum of the 3D curvature of the nose surface makes meters for mouse control. This approach utilizes only one
nose the most robust feature for tracking with high accuracy. camera as video input, but is able to retrieve 3D head mo-
For nostril tracking, skin-color region is usually extracted tion parameters. It is also not merely a facial feature tracker,
first, and nostril can be distinguished by its dark color and the non-rigid facial deformation is also formulated as a lin-
unique contour shape. By tracking the X-Y facial feature ear model and can be retrieved. Based on the motion pa-
coordinates, mouse cursor can be navigated. However, we rameters retrieved by our system, we designed 3 different
also notice the movement and location of facial feature in mouse control modes. In the experiments, the controllabil-
video usually does not coincide with people’s focus of at- ity of the 3 mouse control modes is compared. At last, we
tention on screen. This indeed makes the navigation oper- demonstrate how our system can be utilized to play com-
ations un-intuitive and inconvenient. In order to avoid that puter card game in Windows XP environment.
problem, people have proposed to navigate mouse cursor by The rest of the paper is organized as follows: Section
3D head pose. The estimation of 3D head pose usually re- 2 summarizes our face model. Section 3 explains how our
quires tracking of more than one feature. Head pose can tracking system works. Section 4 describes the design of
be inferred by stereo triangulation if more than one cam- our mouse control strategies. Section 5 shows the experi-
era are employed[16][3] or by inference from anthropolog- mental evaluation of the controllability of our system. And
ical characteristics of face geometry[18][10]. Based on the Section 6 summarizes the paper and gives some analysis of
technical developments in this area, some commercial prod- future directions.
ucts have been developed in recent years[15] [11][7].
For mouse control module, the conversion from 2. 3D face modelling
human motion parameters, i.e. position and/or rota-
tion(orientation), etc., to mouse cursor navigation can The 3D geometry of human facial surface can be rep-
be categorized into direct mode, joystick mode, and dif- resented by a set of vertices {(xi , yi , zi )|i = 1, ..., N } in
ferential mode. For direct mode, a one-to-one mapping space, where N is the total number of vertices. In order to
from the motion parameter domain to screen coordi- model facial articulations, a so-called Piecewise Bezier Vol-
nates is established by calibration off-line or by design ume Deformation Model is developed[17]. With this tool,
based on the a prior knowledge about the human-monitor some pre-specified 3D facial deformations can be manually
setting[18]. Joystick mode navigate mouse cursor by the di- crafted. These crafted facial deformations are called Action
rection(or the sign) of the motion parameters. And the Units(AU)[5]. For our tracking system, 12 action units are
speed of the cursor motion is determined by the magni- crafted as shown in Fig. 1.
tude of the motion parameters[9]. In differential mode, the
cumulation of displacement of the motion parameter dis-
placements drives the navigation of the mouse cursor, AU Description
and some extra motion parameter switches on/off the cu- 1 Vertical movement of the center of upper lip
mulation mechanism so that the motion parameter can 2 Vertical movement of the center of lower lip
be shifted backwards without influence the cursor posi- 3 Horizontal movement of left mouth corner
tion. Therefore this mode is very similar to standard mouse 4 Vertical movement of left mouth corner
mode as user can lift mouse and move back to the ori- 5 Horizontal movement of right mouth corner
gin on mouse pad after performing a mouse dragging 6 Vertical movement of right mouth corner
operation[9]. 7 Vertical movement of right eyebrow
8 Vertical movement of left eyebrow
After mouse cursor is navigated to desired location, the
9 Lifting of right cheek
execution of mouse operations, such as mouse button clicks,
10 Lifting of left cheek
is carried out according to further interpretation of user’s
11 Blinking of right eye
motion parameter changes. The most straightforward in-
12 Blinking of left eye
terpretation is to threshold some specified motion parame-
ters. In [13], the mouse clicks is generated based on ”dwell Table 1. Action Units
time”, e.g. a mouse click is generated if the user keeps the
mouse cursor still for 0.5s. In [4], the confirmation and can-
celation of mouse operations is conveyed by head nodding If human face surface is represented by concate-
and head shaking. Timed finite state machine is designed to nating the vertices coordinates into a long vector
detect the nodding and shaking from raw motion parame- V = (x1 , y1 , z1 , x2 , y2 , z2 , ..., xN , yN , zN )T , the Action
ters. Units can be modeled as the vertices displacement of the de-
In this paper, we proposed to use 3D model-based visual formed facial surface from a neutral facial surface, i.e.,
Proceedings of the Second Canadian Conference on Computer and Robot Vision (CRV’05)
0-7695-2319-6/05 $ 20.00 IEEE
(k) (k) (k) (k) (k) (k)
∆V (k) = (∆x1 , ∆y1 , ∆z1 , ∆x2 , ∆y2 , ∆z2 , ..., 3. Visual face tracking
(k) (k) (k)
∆xN , ∆yN , ∆zN )T , where k = 1, ..., K with K be-
ing the total number of key facial deformations. There- 3.1. Initialization of face tracking
fore an arbitrary face articulation can be formulated
as The face modeling equation 3 defines a highly nonlin-
ear system. Fortunately visual face tracking is not a process
V = V̄ + Lp (1)
of finding solution with random initial guess. A initial so-
where V̄ is the neutral facial surface, and L3N ×K = lution is usually provided with high accuracy by manual la-
{∆V (1) , ∆V (2) , ..., ∆V (K) } is the Action Unit ma- beling or by automatic detection of the face to be tracked in
trix, and pK×1 is the AU coefficients that define the articu- the first frame of the video.
lation. An illustration of such synthesis is shown in Fig. 2. Our system provided an automatic tracking initialization
procedure. In the first frame of the video, we do face de-
tection using Adaboosting algorithm[22]. After face detec-
tion, the location of facial features are identified by ASM
techniques[2][19]. And finally, the generic 3D face model
is adapted and deformed to fit to the detected 2D facial fea-
tures. In the worst case, if the automatic procedure goes
awry, user is also provided with GUI tool to fine tune the
initialization result. A initialization result is show in Fig-
ure 3.
(a) (b)
where R(θ, φ, ψ) is rotation matrix, and T = [Tx Ty Tz ] Figure 3. The initialization of our tracking
defines head translation. system
Assuming pseudo-perspective camera model with pro-
jection matrix
f s/z 0 0
M=
0 f s/z 0
3.2. Tracking by iteratively solving differen-
where f denotes the focal length, s denotes scaling factor, tial equations.
the projection of the face surface to image plane can be de-
scribed as Though the system is nonlinear, at each video frame it
can be approximated locally by first order Taylor expan-
ViImage = M R(θ, φ, ψ)(V̄i + Li p) + T (3) sion with the solution at previous frame as origin. There-
fore tracking can be formulated as a iterative process of
where ViImage is the projection of the i − th vertex node on solving linearized differential equations. The displacement
face surface. Therefore a face motion model is character- of ViImage , i = 1, ..., N can be estimated from video se-
ized by rigid motion parameters including rotation θ, φ, ψ, quence as optical flow ∆ViImage = [∆Xi , ∆Yi ]T ,i =
translation T = [Tx Ty Tz ], and non-rigid facial motion pa- 1, ..., N . The displacement of model parameters, i.e. dW =
rameter, the AU coefficient vector p. [dθ, dφ, dψ]T , dp, dT = [dTx , dTy , dTz ]T can be computed
Proceedings of the Second Canadian Conference on Computer and Robot Vision (CRV’05)
0-7695-2319-6/05 $ 20.00 IEEE
with Least-Mean-Square error once the Jacobian matrix faster for multiplication operations comparing to using Eq.
∆ViImage 4.
dW,dp,dT for each vertex node Vi , i = 1, ...N is computed.
More details of the computation can be found in [17].
The whole tracking procedure is illustrated by Fig. 4. At 3.4. Performance of our tracking system
the first frame of the video, the model parameter is initial-
ized. From the second frame on, the optical flow at each Our tracking system currently runs on a Dell worksta-
vertex on face surface is computed, the displacement of the tion with dual CPU of 2GHz and SCSI RAID hard drive.
model parameters is estimated by solving the system of lin- The setup of our camera system is shown in Figure 5. The
ear equations and the model parameters are updated accord- camera is mounted beneath the screen, and looks upward
ingly. This procedure iterates for each frame, and for each to the user’s face. Sequences(120 frames for about 10 sec-
frame, it also iterates in a coarse-to-fine fashion. onds) of the head pose parameters Tx , Ty , Tz , Rx , Ry , Rz
and AU coefficients related to mouth and eyes estimated
by the tracker are shown in Figure 6. The figure indicates
the user moved his head horizontally (in x coordinate) and
back-forth(in z coordinate), rotated his head about y axis
(yaw movement) in the first 40 frames, and opened his
mouth wide at about the 80-th frame, and blinked his eyes
at about the 20-th, 40-th, and 85-th frame.
Proceedings of the Second Canadian Conference on Computer and Robot Vision (CRV’05)
0-7695-2319-6/05 $ 20.00 IEEE
m2 n2
(f (x + i, y + j) − µf )(t(i, j) − µt )
i=− m
j=− n
N Corr(x, y) = m2 n2
2 2
m2 n2 (4)
{( i=− m j=− n (f (x + i, y + j) − µf )2 )( i=− m
2 1/2
j=− n (t(i, j) − µt ) )}
2 2 2 2
where
m2 n2 m2 n2
i=− m j=− n f (x + i, y + j) i=− m j=− n t(i, j)
µf (x, y) = 2 2
, µt (x, y) = 2 2
mn mn
m2 n2
mn i=− m j=− n f (x + i, y + j)t(i, j) − µf µt
N
Corr(x, y) = m2 n2
2 2
m2 n2 (5)
{(mn i=− m j=− n f 2 (x + i, y + j) − µ2f )(mn i=− m
2 2 1/2
j=− n t (i, j) − µt )}
2 2 2 2
where
m n m n
2
2
2
2
of the tracker. All these advantages make our tracking sys- else y=0;
tem a good candidate for the visual tracking module in cam- return y; }
era mouse. The ∆(x) function is a step function in which the constants
are specified empirically. We found it is easier for the user
4. Mouse cursor control to learn to move the mouse cursor with desired speed and
to keep cursor still at desired location by changing his/her
The direct mode, joystick mode and differential mode head pose with such a step control function.
are implemented for the mouse control module. For the di- For differential mouse control mode, we have the follow-
rect mode, the face orientation angle Rx , Ry (the rotation ing control rules
angle with respect to x and y coordinate)are mapped to the
mouse cursor coordinates (X, Y ) on the screen. As the re- X t+1 = X t + α∆Ryt bt
liable tracking range of Rx and Ry is about 40o , and the Y t+1 = Y t + β∆Rxt bt
resolution of the screen is 1600 × 1200, we therefore em-
pirically let the mapping function to be where
0 Tzt < Tz0
X = 40(Ry − Ry0 ) bt =
1 Tzt >= Tz0
Y = 30(Rx − Rx0 )
Therefore the mouse is navigated by the cumulation of
where Rx0 and Ry0 are the initial face orientation angles.
head orientation displacements ∆Rxt and ∆Ryt with head
For joystick mouse control mode, the following control
moving toward to the camera turns on the mouse drag-
rule is employed. ging state, head moving away from the camera turns on the
0 mouse lifting state.
X t+1
= X + ∆(Ry − Ry )
t
0
The variations in nonrigid motion parameters trigger
Y t+1
= Y + ∆(Rx − Rx )
t
mouse button events. While there are 12 AUs for selec-
tion, not all of them are good for triggering mouse event.
with the rule function ∆(x) defined as
Ideally the detection of AU should be robust against head
double Delta(double x){ pose change and noisy outliers. AU7 and AU8(eyebrows
double y; raising) are not good because eyebrow movements are rela-
if(fabs(x)>15) y=sgn(x)*64; tive subtle for detection. AU9, AU10(cheek lifting) are not
if(fabs(x)>10) y=sgn(x)*16; good because the lack of texture on cheek makes the esti-
else if(fabs(x)>5) y=sgn(x)*4; mation unreliable. And AU11 and AU12(eye blinking) are
else if(fabs(x)>3) y=sgn(x)*1; not good because user may blink his eye unintentionally
Proceedings of the Second Canadian Conference on Computer and Robot Vision (CRV’05)
0-7695-2319-6/05 $ 20.00 IEEE
of 1600 by 1200. The clicking results are illustrated in Fig.
7. From the figure, we can notice the average localization er-
ror for direct mode is about 10 pixels, that for joystick mode
is about 3 pixels, and that for differential mode is about 5
pixels. The localization error is mostly caused by measure-
ment noise introduced during the tracking process, and par-
tially due to human can not really hold his head tightly still.
We then evaluated the reliability of detection of mouth
opening and stretching. When the user is assuming near-
frontal view(within ±30o ) and the tracker is in benign track-
ing status, the detection of mouth opening and stretching
achieves 100% with zero false alarm.
Proceedings of the Second Canadian Conference on Computer and Robot Vision (CRV’05)
0-7695-2319-6/05 $ 20.00 IEEE
Figure 9. Using camera mouse to play Win-
dows card game Solitaire. The user is drag-
Figure 8. Using camera mouse to navigate in ging Spade-7 from right toward Heart-8 in the
Windows XP. The user is opening his mouth left by turning his head with mouth opened.
to activate Start menu.
Proceedings of the Second Canadian Conference on Computer and Robot Vision (CRV’05)
0-7695-2319-6/05 $ 20.00 IEEE
[19] J. Tu, Z. Zhang, Z. Zeng, and T. Huang. Face localization via
hierarchical condensation with fisher boosting feature selec-
tion. In IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’04), volume 2, pages
719–724, 2004.
[20] M. Turk, C. Hu, R. Feris, F. Lashkari, and A. Beall. Tla
based face tracking. In Intern. Conf. on Vision Interfacec,
pages 229–235, Calgary, 2002.
[21] M. Turk and G. Robertson. Perceptual user interfaces.
In Communications of the ACM, volume 43, pages 32–34,
2000.
[22] P. Viola and M. Jones. Fast and robust classification using
asymmetric adaboost and a detector cascade. In Advances
in Neural Information Processing System, volume 14. MIT
Press, Cambridge, MA, 2002.
Proceedings of the Second Canadian Conference on Computer and Robot Vision (CRV’05)
0-7695-2319-6/05 $ 20.00 IEEE