Image and Vision Computing 25 (2007) 321–330
www.elsevier.com/locate/imavis
Human gait recognition at sagittal plane
Rong Zhang a,*, Christian Vogler b, Dimitris Metaxas a
b
a
Department of Computer Science, Rutgers University, 110 Frelinghuysen Road, Piscataway, NJ 08854, USA
Gallaudet Research Institute, Gallaudet University, HMB S-433, 800 Florida Avenue NE, Washington, DC 20002, USA
Received 16 October 2004; received in revised form 24 August 2005; accepted 11 October 2005
Abstract
The reliable extraction of characteristic gait features from image sequences and their recognition are two important issues in gait recognition. In
this paper, we propose a novel two-step, model-based approach to gait recognition by employing a five-link biped locomotion human model. We
first extract the gait features from image sequences using the Metropolis–Hasting method. Hidden Markov Models are then trained based on the
frequencies of these feature trajectories, from which recognition is performed. As it is entirely based on human gait, our approach is robust to
different type of clothes the subjects wear. The model-based gait feature extraction step is insensitive to noise, cluttered background or even
moving background. Furthermore, this approach also minimizes the size of the data required for recognition compared to model-free algorithms.
We applied our method to both the USF Gait Challenge data set and CMU MoBo data set, and achieved recognition rate of 61 and 96%,
respectively. We further studied the relationship between number of subjects within data set and the recognition rate. The results suggest that the
recognition rate is significantly limited by the distance of the subject to the camera.
q 2006 Published by Elsevier B.V.
Keywords: Gait recognition; Biometrics; Human motion analysis; Human identification; Hidden Markov model
1. Introduction
Human can be identified through many biometrics. Face,
iris, and fingerprints have been successfully employed in
automatic human identification systems [4,7,18]. However, as
they require the subject to be very close to the camera for
accurate identification, those characteristic features are not
suited for application to surveillance at low resolutions. Gait, or
the manner people walk, is the only feature available for
recognition when the subject is far from the camera, and has
attracted increased attention recently [5,21,24,25]. Moreover,
behavioral features such as gait are more difficult to be
disguised than facial features. Any attempt to conduct a
different gait only makes the subject more suspicious.
People can identify walking acquaintances even when they
are too far to be recognized by their face. It is commonly
agreed that human visual system is extremely sensitive to
motion stimuli, although the exact mechanism is still unclear
[33]. In 1973, Johansson [15] introduced a new visual stimulus
called point-light display, in which human action is reduced to
* Corresponding author.
E-mail address:
[email protected] (R. Zhang).
0262-8856/$ - see front matter q 2006 Published by Elsevier B.V.
doi:10.1016/j.imavis.2005.10.007
a few moving points of light. It was observed that biological
motion could be accurately perceived with such a simple visual
input. More recent studies performed by Stevenage et al. [28]
demonstrated that human movements in video could be used as
a reliable cue to identify individuals. These findings inspired
researchers in computer vision to extract potential characteristic gait signatures from image sequences for human
identification.
To specify idiosyncratic gait features in image sequences
containing moving subjects is challenging due to the
similarities in the spatiotemporal walking patterns of different
people. The difference in kinematics and dynamics of the
human body motion need to be detected for identification
purposes. In addition, these features should be invariant to
other factors such as clothes, or the hair style. Early research
performed by Murray [23] showed that gait was a unique
behavioral characteristic if all gait movements were considered. However, in general gait recognition settings, the input
signals are image sequences taken by cameras. The loss of the
depth information in those two-dimensional image sequences
makes it impossible to fully recover the three-dimensional gait
patterns. Therefore, gait recognition in computer vision field is
commonly performed on two-dimensional features extracted
from two-dimensional image sequences.
Observing that most walking dynamics take place in the
sagittal plane, we aims to identify walking subjects through
322
R. Zhang et al. / Image and Vision Computing 25 (2007) 321–330
Fig. 1. Outline of the proposed algorithm.
video sequences containing their side view. In this paper, we
propose a two-step, model-based approach for gait recognition
employing exclusively biological motion information. The
proposed algorithm is outlined in Fig. 1. First, we fit each
image using a five-link biped human locomotion model to
extract the joint position trajectories. These features are
invariant to the subjects’ clothes, hair style, or the human
body shape information expect the height of each body parts.
The recognition step is then performed using Hidden Markov
Models (HMMs) based on the frequency components of these
joint trajectories. Applying our approach to both the CMU
MoBo and the USF Gait challenge data sets, we demonstrate
that promising recognition rates can be obtained only using gait
features.
This paper is organized as follows. Section 2 summarizes
the existing approaches to the gait recognition problem. The
five-link biped human model is described in Section 3. Section
4 provides details of the extraction of gait features, whereas
recognition using HMM is described in Section 5. Experimental results are presented in Section 6, followed by conclusion in
Section 7.
2. Previous approaches to gait recognition
Current approaches for the gait recognition problem all
contain two major components: extraction of motion features
from image sequences, and the subsequent identification by
comparing the similarity between the probe feature sequence
and those in the gallery.
Existing methods for feature extraction can be divided into
two categories: model-based and model-free approaches. An
explicit structure is employed to interpret the human body
movements in image sequences in model-based approaches,
whereas features are extracted without considering human
structure in model-free methods.
Model-free methods focus on the spatiotemporal information contained in the silhouette images, which are binary
images representing whether a pixel in the image belongs to the
subject or not. This representation has the advantage by
eliminating the texture and color information of the subject.
Two baseline approaches were proposed for gait recognition
based on the silhouette images: Phillips et al. [25] compared
the entire silhouette image sequences to the gallery, while
Collins et al. [5] selected key frames for comparison. Murase
etc. [11] extracted eigenspace representation from silhouette
image sequences, while Huang et al. [14] extended the method
using Canonical space transformation. Other low level image
features are extracted for identification of the spatial and
temporal variances of the human gait, which include the width
of the outer contour of the silhouette [16], gait mask responses
[9], moments of the optical flows [22], linear decomposition of
style and content [19], generalized symmetry operator [12],
eigengait [2] etc. Lee et al. [21] fit seven ellipses in the human
body area, and used their locations, orientations, and aspect
ratios as features to represent the gait. All features used in these
model-free approaches are susceptible to noise and background
cluttering, since they are calculated either on the pixel level
(background subtraction) or within small regions (edge map
and optical flow calculation). Recently, silhouette refinement
[20] has been proposed to improve the recognition rate.
However, the features extracted using the above methods
include shape information, which should be avoided for gait
recognition.
With gait features closely related to the walking mechanics,
model-based approaches have the potential for robust feature
extraction. Human-like structures have been proposed in gait
feature extraction. A two-dimensional stick model was
obtained through line fitting to the skeleton of the silhouette
images by Niyogi and Adelson [24]. Cunado et al. [6] modelled
the thighs as interlinked pendular to extract their angular
movements. Yam et al. [34] used the pendulum model to
explore the difference between walking and running motions.
These recognition features are extracted over large regions,
which are less sensitive to the image noise. More significantly,
these features do not contain shape information. Though
compact in representation, the above features were used to
achieve satisfactory recognition rates in small gait data sets,
demonstrating the potential of identifying persons only using
movement features.
Following the gait feature extraction, an identification step
is performed based on the similarity measurement between the
probe and training sequences. Many classification methods has
been applied for this purpose, such as Canonical analysis [9],
covariance measurement [25], and support vector machine
[19]. Nearest neighbor or K-nearest neighbor methods were
intensively applied [2,6,12,16,21,24,31], in which gallery
examples are sorted according to its similarity to the probe
sequence, and the probe sequence is labeled as the one that
most of the top k examples belong to. Another commonly used
method is Hidden Markov model (HMM) [17,30], a useful tool
for representing temporal dependent processes. Human gait is a
periodic procedure over time, therefore, HMMs are appropriate
R. Zhang et al. / Image and Vision Computing 25 (2007) 321–330
for coping with the stochastic properties of the dynamics.
Exemplars of width vector [17] or postures [30] were used as
the states of the HMMs to characterize the gait. Sequences of
the same subject in the data set are used to train an HMM, and
identification is executed by choosing the label of the HMM
that generates the given probe sequence with the highest
probability.
Here, we propose a five-link biped model to extract motion
trajectories of the joint positions in image sequences. Unlike
the method used in [35], where the joint angles were extracted
via a line fitting process, our fitting algorithm is applied to the
entire body parts, leading to more robust and reliable results.
The HMM is used in the recognition step.
3. Five-link biped model
A good human model for gait recognition should be simple;
however, it should also be general enough to capture the
walking dynamics of most people, and to be customized for
different persons. Complicated human models, such as the 3D
deformable model [27], are not practical for efficient human
tracking.
Studies carried out by physiologists show [3] that most
walking dynamics take place in the sagittal plane, or the plane
bisecting the human body (Fig. 2(a)), and that the trajectories
of the legs in the sagittal plane reveal most of the walking
dynamics. Therefore, we are interested in extracting the motion
dynamics information contained in the image sequences of
subjects walking parallel to the camera (side view). Fig. 2(b)
shows a typical side view of a walking person. We choose a
two-dimensional five-link biped locomotion model, shown in
Fig. 2(c), to effectively represent the physical structure and
movement constraints of the human body.
The lower limbs are represented as trapezoids, whereas the
upper body is simplified as the upper half of the human
silhouette without arms. Each body part is considered to be
rigid with movement only allowed at joint positions. The
influence of the arm dynamics has been neglected in our
dynamic model. This treatment is justified as little information
323
of the arms is available in the visual images for people walking
at a distance, which makes it difficult to recover the exact arm
positions. These simplifications are necessary for a compact
model to reduce the computational complexity, while at the
same time enabling capturing most dynamics of the walking
subject.
If the length and width of each part (shape model) is fixed,
the biped model M has seven degrees offreedom, MZ(CZ{x, y},
QZ{q1, q2, q3, q4, q5}), where C is the position of the body
center within the image, and Q is the orientation vector
consisting of sagittal plane elevation angles (SEAs) of the five
body parts. The SEA of a certain body part is defined as the
angle between the main axis of the body part and the y-axis
[29], as shown in Fig. 2(b). However, the size of each body part
in the images may differ for different persons, or the same
person at different distances from the camera. One way to
obtain the shape model for each person is to manually locate
the joint positions and body parts on the first image of each
sequences, which is tedious when the data set is large. In our
work, we develop a model fitting method for the initialization
process, in which the size of each body part in the image is
specified by fitting the human shape model to the silhouette
image.
3.1. Scale-invariant body model
For our purpose, we need a general human shape model
independent of scaling. As shown in Fig. 2, the human body
model, without considering neck and head, consists of five
trapezoids, connected at the joints. Each trapezoid is defined by
its height (l) and the lengths of the top and bottom bases (t and
b, respectively). Hence, each body part pi, iZ1, ., 5, can be
represented by piZ{ai, bi, li}, where aZt/l and bZb/l are the
base-to-height ratios. By normalizing the body part heights
with respect to the height of the trunk (l5), we obtain a shape
model invariant to scaling, which is parameterized by two
vectors: the base-to-height ratio vector, KZ{a1, b1, a2, b2, .,
a5, b5}, and the relative height vector RZ{r1, r2, ., r5}, where
Fig. 2. (a) Illustration of sagittal plane. (b) Side view of a walking subject. (c) Five-link biped human model, where qi (iZ1, ., 5) is the sagittal elevation angle of the
corresponding body part. (d) Schematic representation of individual body part.
324
R. Zhang et al. / Image and Vision Computing 25 (2007) 321–330
riZli/l5. Together with the biped model M, we can describe the
human body posture as HZ{K, R, M}.
We assume that the model parameters are independent of
each other, subject to Gaussian distributions. The orientation
vector is subject to a uniform distribution over an interval LQ,
provided by physical limits of the joints. Thus, the probabilistic
distribution of the human model can be expressed as
H Z fK;R;Mg wGðK;SK ÞGðR;SR ÞUðCÞðQ;LQ Þ
(1)
The means and variances are estimated from the measurements provided in [32].
3.2. Initialization of body shape model
The orientation and the actual size of each body part in the
image are specified in the initialization step. To separate the
subject from background, we perform a background subtraction procedure to obtain the silhouette image as described in
[8]. We assume only one subject is present within each image
and the largest blob within the silhouette image can be regarded
as where the subject will locate. We choose to fit the five-link
biped model to the silhouette image when the two legs are
furthest apart from each other, i.e. the double stance phase. At
this phase, the SEAs of the shank and thigh of the same leg are
approximately identical, and the overlap between the lags are
small, making it less ambiguous to parameterize the body parts.
Note we do not distinguish left or right legs here.
We first obtain a rough estimation of the human model H
from the silhouette image S. The body center position in the
image is set to the weight center of the silhouette pixels
C Z ðx;yÞ Z ðmedianðxi Þ; medianðyi ÞÞ:
xi2S
yi2S
(2)
To calculate the orientation vector Q, we select three subregions within the silhouette image, one for the upper body and
one for each leg as shown in Fig. 3(a). The SEA of each body
part is set as the angle of the main axis for pixels within the
corresponding region, or the axis with the least second moment
[13]. Given Q, the height of each body part can be obtained
based on the height of the silhouette image.
The above estimation provides a good starting point, as
shown in Fig. 3(b), however, it may not be very accurate in
some cases. For example, in Fig. 3(b), we see the locations of
the head and right (in the image) shank in the model deviate
observably from the actual ones. For further refinement, we
seek for a human model H*, which best fits the silhouette image
S. Using Bayesian inference, we formulate this procedure as
H Z arg max pðHjSÞ Z arg max pðSjHÞpðHÞ
H
H
(3)
where the prior distribution p(H) is given in Eq. (1), and the
likelihood p(SjH) specifies the silhouette generating process
from human model H to S.
Assume S 0 is the shape generated by the human model H, CS
is the boundary point set for shape S, and AðSÞ is the
corresponding area. The similarity measure between two
shapes can be determined by the overlap area between two
shapes, and the proximity of the shape boundaries. The
likelihood function, therefore, is defined as
pðSjHÞ Z pðSjS 0 Þ
0
1 w2
Y
w@
GðDðv;CS Þ;s2d ÞÞw1 ðGðAðSÞKAðS 0 Þ;s2S ÞA :
v2CS0
(4)
Here, s2d , and s2s are the variances of the distance and area,
respectively, w1 and w2 are the weights for the two
components, and
Dðv; Cs Þ Z min
dðv; v 0 Þ
0
(5)
v 2CS
is the minimum distance from point v to the contour of S,
calculated by the distance transform.
Finding the global optimal H* is rather difficult due to the
high dimensionality of H. Here, we use the Metropolis–Hasting
method, which guarantees convergence to sample from the
Fig. 3. (a) A silhouette image, where the three blocks correspond to the regions where shape model calculation is performed. (b) Rough estimation result. (c)
Initialization result.
325
R. Zhang et al. / Image and Vision Computing 25 (2007) 321–330
posterior. Starting from the rough estimation H obtained above,
the Metropolis–Hasting steps for adjusting H are:
1. Generate new sample H 0 according to q(H/H 0 ).
0
0
0
qðH / H Þf pðH ÞGðH KH ;SH Þ;
1
0.8
(6)
where SH is the covariance matrix for model parameters.
2. Accept H 0 according to
pðH 0 jSÞqðH 0 / HÞ
a Z min 1;
:
pðHjSÞqðH / H 0 Þ
3. Repeat step 1 and 2 until p(HjS) is high enough or a
maximum number of iterations is reached.
0.6
0.4
0.2
0
–10
The sizes of neck and head are then calculated through the
relative sizes with respect to the trunk length [32]. Fig. 3(c)
shows the initialization result of the silhouette image in
Fig. 3(a), which shows improvement from the rough estimation
in Fig. 3(b). We have noticed that the initialization step relies
on the quality of the silhouette image. Hence, further
refinement of the parameters may be needed if the silhouette
image is severely corrupted.
After the initialization step, the location and size of each
body part within image are specified, therefore, we can obtain
the appearance model (W) of each body part based on the color
information within the corresponding image region.
0
5
10
Fig. 4. The Geman–McClure function.
Here, wc, wA ; and wm are three weight factors, r is the
Geman–McClure function, shown in Fig. 4, is defined as [1]
rðx;sÞ Z
x2
;
s C x2
(8)
2
which constrains the effect of large residue (x) value as it
saturates at one for large values of x. The robust scale
parameter s is defined as:
s Z 1:4826 !medianjIt KIðMt ;WÞj
(9)
The minimization of the energy term in Eq. (7) is equivalent
to maximizing the following probability
4. Tracking
Since we have extracted the shape model and the initial
configuration, the next step is to extract gait signatures over
time based on this shape model. This is also a two-dimensional
tracking problem, i.e. locating image position and orientation
of each body part over image sequences. Current 2D-based
tracking methods use either image edges or dense optical flow
for detection and tracking. However, image cues, such as
optical flows and edges, are not entirely reliable, especially
when calculated from noisy images. To achieve robustness, we
need to carry out our computations within large regions, e.g. at
the body part level. The image information we utilize is the
color and the inner silhouette region.
For an input frame It at time instance t, we use the
background model to obtain the silhouette image St. Given the
appearance model (W) and human model parameters MtZ(Ct,
Qt), we can compose an image I(Mt; W). The best human
model configuration should make this image as close to It as
possible. In addition, the area of the human model should be
equal to the area of the silhouette image, and the difference of
the biped model configurations between time instances tK1
and t is small. Therefore, we estimate the best biped model Mt
by minimizing the total energy of the following equation,
X
rðIt KIðMt ;WÞ;sÞ C wA ðAðSt ÞKAðMt ÞÞ2
E Z wc
C wm jMt KMtK1 j:
–5
(7)
pðMt jIt Þf expðKEÞ;
(10)
by employing the same Metropolis–Hasting method used in the
initialization step.
The initial Ct is calculated as the mass center of the
silhouette image as given in Eq. (2). The predicted orientations
are given by:
Qt Z 2 * QtK1 KQtK2
(11)
Knee stride width
Knee
elevation
Ankle elevation
Ankle stride width
Fig. 5. The space domain features.
326
R. Zhang et al. / Image and Vision Computing 25 (2007) 321–330
200
200
100
100
0
0
100
100
200
0
0.2
0.4
0.6
Gait cycle
0.8
1
200
0
0.2
0.4
0.6
Gait cycle
0.8
1
Fig. 6. The trajectories of space domain features for two different subjects.
5. Recognition
Based on the tracking results obtained with the biped model,
the differences among people are largely temporal. It is,
therefore, necessary to choose a feature representation that
makes the temporal characteristics of the data signal explicit.
The sagittal elevation angles extracted from the above tracking
procedure capture the temporal dynamics of the gait of the
subject, whereas the trajectories of the corresponding joint
position reveal the spatial-temporal history. In addition, studies
showed that the SEAs exhibit less inter-subject variation across
humans [3,29]. Therefore, our recognition method focuses on
the joint position trajectories, which is described in details in
this section.
5.1. Recognition features
To this end, we first compute the following space domain
features: ankle elevation (s1), knee elevation (s2), ankle stride
width (s3), and knee stride width (s4), as illustrated in Fig. 5.
The trajectories of the above four features for two different
subjects are shown in Fig. 6. In this plot, all four trajectories are
truncated into several pieces, each containing the motion
dynamics within one gait cycle. Then all gait cycles length are
normalized from 0 to 1. From Fig. 6, we see that the differences
between the two subjects are subtle. To distinguish two gait
sequences, a frequency domain-based representation seems
particularly suitable due to the cyclic nature of gait.
For each of these four features si, we compute the Discrete
Fourier Transform (DFT), denoted as Si over a fixed window
size of 32 frames, which we slide over the feature signal
sequences:
Si ðnÞ Z
31
1 X
s ðkÞeK2pink=32 ;
32 kZ0 i
n Z 0;.;31:
The DFTs reveal periodicities in the feature data as well as
the relative strengths of any periodic components. Since, the
zero-th order frequency component does not provide any
information on the periodicity of the signal, while high
frequency components mainly capture the noise, we sample
the magnitude and phase of the second to fifth lowest frequency
components. This leads to a feature vector containing 4
magnitude and 4 phase measures for each of the four space
domain base features (S1,.,S4) leading to an overall dimension
of 32.
5.2. Recognition method
After computing the features for each of the gait data
samples, we then segment the resulting data stream according
to its gait cycles, such that any single example contains only the
data from a single gait cycle. In this way, recognition is
analogous to isolated speech recognition. Therefore, we
applied the Hidden Markov model (HMM) for identification,
which have been used successfully in speech recognition.
We consider HMM of degree one, where the current state
depends only on the previous state. The observation for HMM
is the 32-dimensional feature described above. The HMM is
(12)
The size of 32 frames is chosen close to a typical human gait
cycle. Future work should also investigate adaptive window
size based on the actual period of the gait cycle of different
person.
Fig. 7. (a) Sample image from CMU MoBo data set. (b) Result for fitting biped
model.
327
R. Zhang et al. / Image and Vision Computing 25 (2007) 321–330
Fig. 8. Tracking results for one subject.
represented as (p, A, B). In the training step, the initial state
distribution p, transition probabilities A and the observation
probability B are estimated using the standard Baum–Welch reestimation method [26].
Given a test example, we compute the likelihood of each
HMM on the example, and choose the HMM with the highest
likelihood as the correct one, i.e. label it as k* such that k*Zarg
maxk pk(Ojp, A, B).
With multiple gait cycles for the same person, we can
recognize each gait cycle individually using the above method.
To combine the recognition results together, we aggregate the
N-best recognition: for the recognition result of each gait cycle,
assign a score of 20 to the first rank, 19 to the second rank, and
so on; and then sum up all the rank scores for all possible
hypotheses, and pick the result with the highest cumulative
score. Performing aggregation in this way yields an improved
identification rate.
Fig. 9. We achieve a 96% identification accuracy (RankZ1),
and the correct identification always occurs within top 3 ranks.
We also carried out the following experiments on this data
set:
1.
2.
3.
4.
Train with slow walk and test with slow walk.
Train with fast walk and test with fast walk.
Train with incline walk and test with incline walk.
Train with walking while holding a ball and test with
walking while holding a ball.
5. Train with slow walk and test with walking while holding a
ball.
In the first four experiments, the sequences are divided into
training and testing sets, by the ratio of 4:1. In case (5), the
100
6. Experiments and results
6.1. CMU MoBo data set
The CMU data set contains video sequences of 25
individuals with 824 gait cycles, who are walking on a
treadmill under four different conditions: slow, fast, incline, or
with a ball in hand. Figs. 7 and 8 show the tracking result of
sample images.
In our first experiment, we split the gait cycles randomly
into a training and a test set by a ratio of 3:1, so that both sets
contain a mix of examples from all four walking activities. The
cumulative match score is plotted as a function of rank in
Cumulative Match Scores
The above algorithm is applied to both the CMU MoBo data
set [10] and the USF Gait challenge data set [25].
95
90
85
80
75
70
65
60
55
50
1
2
3
4
5
6
Rank
7
Fig. 9. CMS plot of CMU gait data.
8
9
10
328
R. Zhang et al. / Image and Vision Computing 25 (2007) 321–330
Table 1
Identification rate PI at ranks 1, 2, 5 and 10 on the CMU MoBo data set for
various experiments
Train set
Test set
PI (%)
RankZ1
2
5
10
1
2
3
4
Slow
Fast
Incline
Holding
ball
Slow
Slow
Fast
Incline
Holding
ball
Holding
ball
100
96.0
95.8
100
100
100
100
100
100
100
100
100
100
100
100
100
5
52.2
60.9
69.6
91.3
entire slow walk sequences are used for training, and only one
gait cycle for walking with a ball is used for evaluation.
The results of the five experiments are summarized in
Table 1. As we can see, the recognition rate hit 100% at the top
match for experiments (1) and (4), and is nearly perfect (around
96%) for (2) and (3). This shows that our method performs
better than those shape-based approach such as [16],
suggesting that the motion dynamics of different subjects are
accurately captured using our five-link biped human model.
However, our recognition result for experiment (5) is
significantly worse than those for (1)–(4). A common
observation from physiology study is that human locomotion
must satisfy the postural stability and dynamic equilibrium.
Hence, the subject slightly changes his/her gait when holding a
ball, due to the necessary adjustment to balance the additional
weight, and the restriction of arm movement. Consequently,
the poor recognition rate in this case is a natural outcome for
methods using only dynamic information. The higher
recognition rates achieved using other recognition methods is
an indication that the human shape information are employed
in addition to gait (Fig. 10).
6.2. USF gait challenge data set
The USF gait challenge data set contains people walking in
a natural outdoor setting. Since, they are taken outdoors, we
90
Cumulative Match Scores
Experiment
100
80
70
60
50
40
30
20
10
0
1
3
5
7
9
11
Rank
13
15
17
19
Fig. 10. CMS plot of USF gait data.
need to handle special background changes due to shadows,
moving background, lighting changes, etc. Therefore, we
applied the non-parametric background modelling [8] for
silhouette extraction. Fig. 11 shows typical images in that data
set and the corresponding silhouette images, whereas Fig. 12
show regions in the silhouette images Containing the subject.
Due to the color similarity between the subject and the outdoor
scene, the silhouette images are not smooth and may be
discontinuous. Therefore, only region based approaches can
provide reliable feature extraction results.
Although the subjects walk along an ellipsis track on the
ground, we only pick the sequence where they are walking
parallel to the camera. Typically, each gait data sample in this
data set contains 4–7 individual gait cycles. Overall, there are
75 subjects in the data set, with a total of 2045 gait cycles. 75%
of the cycles are randomly selected to form the training set,
with the rest forming the test set. Both sets contain a mix of
examples from the subjects with different camera views, types
of shoes and surfaces. The identification rate (rankZ1) for the
entire USF data set is 61%, as shown in Fig. 10.
Fig. 11. (a) Original images. (b) Silhouette images.
R. Zhang et al. / Image and Vision Computing 25 (2007) 321–330
329
Fig. 12. Sample silhouettes from the USF data.
The recognition rate is lower than that for the CMU data set,
which may be attributed to either the number of subjects in the
data set, or the distance between the subjects and the camera.
We randomly choose a certain number of subjects from the
USF data set and use their data for the recognition process.
The process is repeated 20 times for data set size varying from
15 to 40. The average recognition rate and the standard
deviation are shown in Fig. 13 as a function of rank. Clearly,
increasing the data size leads to a decrease in the recognition
rate. Using 25 subjects as in the CMU MoBo data set, a
recognition rate of (77G5)% is obtained for the USF data set.
Although this represents a significant improvement from 61%
when all 75 subjects are used, it does not completely account
for the lower recognition rate for the USF data set. We notice
that the average image length of thighs for the subjects here is
26.7 pixels, compared to w130 in the CMU data set.
Therefore, the accuracy of the extracted feature is limited in
USF data set. The subtle inter-subject movement differences
could not be fully extracted from these images, which results in
lower recognition rate. Both the number of subjects and the
image resolution are hence important in affecting the
recognition rate.
7. Conclusion
In this paper, we have shown a novel two-step, model-based
approach for gait recognition using exclusively the human
body movement information within the sagittal plane. As the
95
Recognition rate at rank 1
Acknowledgements
The authors would like to thank Shan Lu for fruitful
discussions, and Stratos Loukidis for setting up the data-base.
This work is supported by the National Science Foundation,
contract number NSF-ITR-0205671, NSF-ITR-0313184, and
NSF: 0200983.
References
100
90
85
80
75
70
65
60
55
50
15
concealable shape and appearance information is avoided,
robust recognition is achieved using our method. Applying this
approach to CMU MoBo data set and USF Gait Challenge data
set, we achieve recognition rates of 96% and 61%,
respectively. The lower recognition rate for USF data set is
attributed to both the larger number of subjects and the longer
distance from camera to subjects. This suggests that proper
zoom lenses are needed to ensure that the gait motion is seen at
sufficient detail.
The experimental results demonstrate that the sagittal plane
contains identification information. Other view point may also
contains important gait features. For example, images captured
from camera facing the subject reveals additional swaging and
toe-in/toe-out information, which may also be useful in
recognition. In our future work, we would combine the frontal
view of the subject for other information such as the toe-out
and the bending of the legs [29], to further improve the
recognition rate.
20
25
30
35
Number of subjects in data–set
Fig. 13. The recognition change according to size of data set.
40
[1] S. Ayer, H.S. Sawhney, Layered representation of motion video using
robust maximum-likelihood estimation of mixture models and MDL
encoding, in: IEEE International Conference on Computer Vision, 1995,
pp. 777–784.
[2] C. BenAbdelkader, Ross Cutler, Harsh Nanda, L.S. Davis, Eigengait:
motion-based recognition of people using image self-similarity, in:
Proceedings of the International Conference on Audio and Video-based
Person Authentication, 2001.
[3] A. Borghese, L. Bianchi, F. Lacquaniti, Kinematic determinants of human
locomotion, Journal of Physiology 494 (3) (1996) 863–879.
[4] R. Chellappa, C. Wilson, S. Sirohev, Human and machine recognition of
faces: a survey, Proceedings of IEEE 83 (5) (1995) 705–740.
[5] R.T. Collins, R. Gross, J. Shi, Silhouette-based human identification from
body shape and gait, in: International Conference on Automatic Face and
Gesture Recognition, 2002.
[6] D. Cunado, M.S. Nixon, J.N. Carter, Using gait as a biometric, via phaseweighted magnitude spectra, in: First International Conference Audio and
Video based Biometric Person Authentification, 1997.
330
R. Zhang et al. / Image and Vision Computing 25 (2007) 321–330
[7] John G. Daugman, High confidence visual recognition of persons by a test
of statistical independence, IEEE Transactions on Pattern Analysis and
Machine Intelligence 15 (11) (1993) 1148–1161.
[8] A. Elgammal, D. Harwood, L. Davis, Non-parametric model for
background subtraction, in: Sixth European Conference on Computer
Vision, 2000.
[9] J.P. Foster, M.S. Nixon, A. Prudel-Bennett, Automatic gait recognition using
area-based metrics, Pattern Recognition Letters 24 (14) (2001) 2489–2497.
[10] R. Gross, J. Shi, The CMU motion of body (MoBo) database, Technical
Report CMU-RI-TR-01-18, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, June 2001.
[11] H. Murase, R. Sakai, Moving object recognition in eigenspace
representation: gait analysis and lip reading, Pattern Recognition Letters
17 (2) (1996) 155–162.
[12] J.B. Hayfron-Acquah, M.S. Nixon, J.N. Carter, Automatic gait
recognition by symmetry analysis, Pattern Recognition Letters 24 (13)
(2003) 2175–2183.
[13] R.K.P. Horn (Ed.), Robot Vision, MIT Press, Cambridge, MA, 1986.
[14] P.S. Huang, C.J. Harris, Mark S. Nixon, Human gait recognition in
canonical space using temporal templates. In: IEE Proceedings—Vision,
Image and Signal Processing, vol. 2(146), 1999, pp. 93–100.
[15] G. Johansson, Visual perception of biological motion and a model for its
analysis, Perception and Psychophysics 14 (2) (1973) 201–211.
[16] A. Kale, N. Cuntoor, B. Yegnanarayana, A.N. Rajagopalan, R. Chellappa,
Gait analysis for human identification, in: Proceedings of the Third
International Conference on Audio and Video Based Person Authentication, 2003.
[17] A. Kale, A.N. Rajagopalan, N. Cuntoor, V. Kruger, Gait based
recognition of humans using continuous HMMs, in: Face and Gesture
Recognition, 2002.
[18] K. Karu, A.K. Jain, Fingerprint classification, Pattern Recognition 29 (3)
(1996) 389–404.
[19] C.S. Lee, A. Elgammal, Gait style and gait content: bilinear model for gait
recognition using gait re-sampling, in: Sixth International Conference on
Automatic Face and Gesture Recognition, 2004.
[20] L. Lee, G. Dailey, K. Tieu, Learning pedestrian models for silhouette
refinement, in: International Conference on Computer Vision and Pattern
Recognition, 2003.
[21] L. Lee, W.E.L. Grimson, Gait analysis for recognition and classification, in:
IEEE Conference on Face and Gesture Recognition, 2002, pp. 155–161.
[22] L. Little, J. Boyd, Recognizing people by their gait: the shape of motion,
Videre 1 (2) (1996) 1–32.
[23] M.P. Murray, A.B. Drought, R.C. Kory, Walking patterns of normal men,
Journal of Bone and Joint Surgery 46-A (2) (1964) 335–360.
[24] S.A. Niyogi, E.H. Adelson, Analyzing and recognizing walking figures in
xyt, in: IEEE Conference on Computer Vision and Pattern Recognition,
1994.
[25] P.J. Phillips, S. Sarkar, I. Robledo, P. Grother, K. Bowyer, The gait
identification challenge problem: data sets and baseline algorithm, in:
International Conference on Pattern Recognition, 2002.
[26] L.R. Rabiner, A tutorial on hidden markov models and selected
applications in speech recognition, Proceedings of the IEEE 77 (2)
(1989) 257–286.
[27] C. Sminchisescu, B. Triggs, Covariance scaled sampling for monocular
3d body tracking, in: IEEE International Conference on Computer Vision
and Pattern Recognition, 2001.
[28] S.V. Stevenage, M.S. Nixon, K. Vince, Visual analysis of gait as a cue to
identity, Applied Cognitive Psychology 13 (1999) 513–526.
[29] H. Sun, Curved Path Human Locomotion on Uneven Terrain, PhD thesis,
University of Pennsylvania, 2000.
[30] A. Sundaresan, A. RoyChowdhury, R. Chellappa, A hidden markov
model based framework for recognition of humans from gait sequences,
in: International Conference on Image Processing, 2003.
[31] R. Tanawongsuwan, A. Bobick, Gait recognition from time-normalized
joint-angle trajectories in the walking plane, in: CVPR, vol. 2, 2001, pp.
726–731.
[32] A.R. Tilley (Ed.), The Measure of Man and Woman: Human Factors in
Design, H.D. Associates, New York, 1993.
[33] N.F. Troje, Decomposing biological motion: a framework for analysis and
synthesis of human gait patterns, Journal of Vision 2 (2002) 371–387.
[34] C.Y. Yam, M.S. Nixon, J.N. Carter, On the relationship of human walking
and running: automatic person identification by gait, in: International
Conference on Pattern Recognition, 2002.
[35] J.H. Yoo, M.S. Nixon, C.J. Harris, Extracting human gait signatures by
body segment properties, in: Southwest Symposium on Image Analysis
and Interpretation, 2002, pp. 35–39.