Gait-based Recognition of Humans Using Continuous HMMs
A. Kale1, A.N. Rajagopalan2, N. Cuntoor1 and V. Krüger1
1
Center for Automation Research
University of Maryland at College Park
College Park, MD 20742
2 Department of Electrical Engineering
Indian Institute of Technology Madras
Chennai-600 036, India
Abstract
on a new dimension. Biometrics such as fingerprint or iris
are then no longer applicable. Furthermore, night vision capability (an important component in surveillance) is usually
not possible with these biometrics. A biometric that can address some of these shortcomings is the human ’gait’. The
attraction of using gait as a biometric is that it is nonintrusive and typifies the motion characteristics specific to an individual. It is a well-known fact that people often recognize
others by simply observing their gait which may justify using it as a cue for recognizing people from a small database.
However, if the database is large, then gait information, by
itself, may not be sufficient to discriminate each individual.
But it still makes good sense to use gait as an indexing tool to
greatly narrow down the search for potential targets. Early
medical studies [7] suggest that if all movements are considered, gait is unique. In all, it appears that there are 24 different components to human gait. However, from a computational perspective, it is quite difficult to accurately extract
these components. Precise extraction of body parts and joint
angles in real visual imagery is a very cumbersome task and
can be unreliable. Hence, the problem of representing and
recognizing gait turns out to be a challenging one. A careful analysis of gait reveals that it has two important components, a structural component which captures the physical
build of a person and a dynamic component which captures
the transitions that the body undergoes during a walk cycle.
If one could effectively capture these components, then it
should be possible to recognize gait.
Gait is a spatio-temporal phenomenon that typifies the
motion characteristics of an individual. In this paper,
we propose a view based approach to recognize humans
through gait. The width of the outer contour of the binarized
silhouette of a walking person is chosen as the image feature.
A set of stances or key frames that occur during the walk
cycle of an individual is chosen. Euclidean distances of a
given image from this stance set are computed and a lower
dimensional observation vector is generated. A continuous
HMM is trained using several such lower dimensional vector sequences extracted from the video. This methodology
serves to compactly capture structural and transitional features that are unique to an individual. The statistical nature
of the HMM renders overall robustness to gait representation and recognition. Human identification performance of
the proposed scheme is found to be quite good when tested
in natural walk conditions.
1
Introduction
The need for automated person identification is growing
in many applications such as surveillance, access control
and smart interfaces. It is well-known that biometrics are
a powerful tool for reliable automated person identification.
Established biometric-based identification techniques range
from fingerprint and hand geometry methods to schemes
like face recognition and iris identification. However, these
methodologies are either intrusive or restricted to very controlled environments. For example, current face recognition
technology is capable of recognizing only frontal or nearly
frontal faces. When the problem of person identification is
attempted in natural settings, such as those that occur in the
automatic surveillance of people in strategic areas, it takes
In this paper, we present a method that directly incorporates the structural and transitional knowledge about the
identity of the person performing the activity. This knowledge is used to generate a lower dimensional observation
vector sequence which is then used to design a continuous
density HMM for each individual. In the next section we
give an overview of the prior work in the area of activity and
gait recognition. Section 3 motivates our approach. Section
4 describes our methodology for human recognition using
gait. Section 5 describes our experimental results and sec-
Supported by the DARPA/ONR grant N00014-00-1-0908.
1
Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FGR’02)
0-7695-1602-5/02 $17.00 © 2002 IEEE
tion 6 concludes the paper.
2
Prior Work
The task of recognizing people by the way they walk is
an instance of the more general problem of recognition of
humans from gesture or activity. We take a closer look at
the relation between the problems of activity recognition
and activity-specific person identification. A good review
of the state of the art in activity recognition can be found in
[1]. For human activity or behavior recognition most efforts
have used HMM-based approaches [11, 12, 13] as opposed
to template matching which is sensitive to noise and the variations in the movement duration. In [13], discrete HMMs
are used to recognize different tennis strokes. In [11], continuous HMMs are used to recognize American sign language. In [12] a parametric continuous HMM has been applied for activity recognition. All these approaches involve
picking a lower dimensional feature vector from an image
and using these to train an HMM. Note that, if we choose
a not too unreasonable set of features, the trajectories corresponding to distinct activities will be far apart in the vector space of the features. Hence, in principle, with a small
degradation in performance, it is possible to replace the continuous approaches in [11, 12] by building a codeword set
through kmeans clustering over the set of the lower dimensional observation vector space and using a discrete HMM
approach as in [13]. The scenario is very different in the
problem of recognition of humans from activity. Primarily, there is considerable similarity in the way people perform an activity. Hence, feature trajectories corresponding
to different individuals performing the same activity tend to
be much closer to one another as compared to feature trajectories corresponding to distinct activities. The aforementioned activity recognition approaches, if directly applied to
human identification using gait will almost certainly fail in
the presence of noise and structurally similar individuals in
the database.
We now review some of the prior work done in the recognition of humans from gait. In [4], Huang et al. use optical
flow to derive the motion image sequence corresponding to
a gait cycle. The approach is sensitive to optical flow computation. Also, it does not address the issue of phase in a
gait cycle. In another approach, Cunado et al. [3] extract
gait signature by fitting the movement of the thighs to an articulated pendulum-like motion model. The idea is somewhat similar to the work by Murray [7] who modeled the hip
rotation angle as a simple pendulum, the motion of which
was approximately described by simple harmonic motion.
Locating accurately the thigh in real image sequences can,
however, be very difficult. Little and Boyd [5] extracted
frequency and phase features from moments of the motion
image derived from optical flow to recognize different people by their gait. As expected, the method is quite sensitive to the feature extraction process. Bobick and Johnson
used static features for recognition of humans using gait [2].
Murase and Sakai [6] have also proposed a template matching method which is somewhat similar in spirit to the work
reported in [4].
3
Our Approach
One of the issue that arises in the context of gait recognition is the viewing angle and invariance to thereof. It is
reasonable to choose the viewing angle that yields maximum observable dynamics since the perceptible change in
the structural gait information does not change significantly
with viewing angle. We therefore analyze the side view of a
person walking, allowing for minor angular variations.
A possible solution to gait representation/recognition lies
in a closer examination of the physical process of gait generation. During a gait cycle, it is possible to identify certain distinct stances (Figure 1) that are generic, in the sense
that every person transits between these successive stances
as he/she walks. These stances partly encode identity information by virtue of the structural differences between people. In practice, an accurate time-stamping of these stances
is impossible. A precise demarcation of when the image
undergoes a transition from one stance to another is difficult. Hence, using structural information alone may not
yield good discriminability.
There is a Markovian dependence from one stance to another. The gait cycle can be viewed as a doubly stochastic process in which the hidden process is represented by
the transitions across the stances while the observable is the
image generated when in a particular stance. The HMM
is best suited for describing such a situation. Formally, a
HMM is defined as a doubly stochastic process that is not
directly observed but can only be studied through another
set of stochastic processes that produce the given sequence
of observations. Markovian transitions are assumed to occur
between states and a random observation is output in a particular state. For a detailed discussion on HMMs and their
applications see [9].
As described earlier, the gait cycle consists of distinct
stances. It is our conjecture that these stances can be associated with the states of an HMM where the switch from one
stance to another can be represented by transition probabilities between states.
4
Proposed Methodology
An important issue is the extraction of appropriate features that will capture the gait characteristics effectively. Intuitively, the silhouette of a person is a reasonable feature to
look at as it captures the motion of most of the body parts and
also encodes structural as well as transitional information. It
is reasonably independent of the clothing worn by the person and it supports night vision capability as it can be easily
derived from IR imagery. Successful training of the HMM
depends largely on the dimension of the observation vector.
Clearly, the silhouette information cannot be used as is due
Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FGR’02)
0-7695-1602-5/02 $17.00 © 2002 IEEE
to its large dimension. Compact encoding of the information contained in the silhouette is necessary for good performance. We now describe a procedure to efficiently encode
this information. This is followed by a detailed description
of training and evaluation modules.
4.1
bshow1011.jpg
bshow1013.jpg
bshow1015.jpg
bshow1038.jpg
bshow1041.jpg
bshow1043.jpg
Silhouette Extraction
bshow1018.jpg
bshow1025.jpg
bshow1047.jpg
bshow1053.jpg
(a)
In our experiments, the camera is assumed to be static and
that only one person is within the field of view.
Given the image sequence of a subject, the silhouette is
generated as follows:
1. Background subtraction is used to detect moving objects in each frame; subsequently a blob tracker tracks
the fastest moving object in the scene, thereby reducing effects of minor disturbances in the background.
2. A standard 3 3 erosion filter is applied to the motion
image to remove spurious noise.
3. Since we are interested in the outer contour of the
body only, the left and right boundaries of the body
are traced by examining the pixel intensities with a
weighted low pass filter from leftmost and rightmost
ends of the image.
4. The width of the silhouette along each row of the image is then stored. The width along a given row is
simply the difference in the locations of rightmost and
leftmost boundary pixels in that row.
Typical silhouette images extracted from a video sequence
are shown in Figure 1. It may be noted that our silhouette extraction procedure is simple and straightforward. It is quite
possible that the silhouette is sometimes not perfectly extracted. However, the advantage of using a statistical approach (such as the HMM) is that it is robust to such minor
perturbations.
4.2
Training for Gait
Given an observation sequence, we seek a way to build
a representation for the gait of every individual in the
database. As discussed before we opt for the stochastic approach of using continuous HMMs. In this case, training involves learning the HMM parameters = (A; B; ) from
the observation sequences. Here A denotes the transition
probability matrix, B is the observation probability and
is the initial probability vector. In order to capture the gait
of an individual, we train the HMM using the width vectors
derived from the silhouette for several gait cycles of the person. We express the pdf of the observation as
bj (o) = N (o; ; U ); 1 j N
(1)
where o is the observation vector and N is the number
j
U
j
j
of states in the HMM and j are the mean and covariance,
respectively. The reliability of estimates of B depends on
(b)
Figure 1. Five stances corresponding to the gait
cycle of a) Person 1 and b) Person 2.
the number of training samples available and the dimension
of the observation vector.
In a practical situation, only a finite amount of training
data is available. Since the means and covariances in equation (1) have to be learnt from the training samples, the dimension of the observation vector becomes critical. The required number of training samples increases with the dimensionality of the observation vector. To be precise, assume
for the moment that the data can be modeled by a single
Gaussian distribution. Then, for a d-dimensional observation vector, we need at least d training samples to estimate
the centroid and d(d2+1) training samples in order that the covariance matrix would have a well-defined inverse. In our
experiments, the smallest dimension of the width vector of
the silhouette is approximately 100. This implies that we require at least 100 observations to learn the mean value. To
learn the covariance, we would need as many as 5150 vectors! For a mixture of m-Gaussian model, there would be
a further m-fold increase in the number of training vectors.
Clearly, the possibility of using the width vector directly is
ruled out. A more compact way of encoding the observation, while retaining all relevant information, is needed. We
propose the following methodology to tackle the dimensionality issue in the gait problem. To decide on the number
of stances we plot the average rate distortion curve of the
quantization error as a function of number of stances. From
figure (2) we observe that the quantization error does not
decrease appreciably beyond 5 stances. Let us denote the
width vectors corresponding to the five stances for the j th
person as S1j ; ; S5j . These stances are the ones that result
from application of the k-means procedure to the training
data available for that individual. The Euclidean distance
between an observed width vector in frame k (denoted by
OW j (k)) and the lth stance is given by
jjOW j (k) ; Slj jj
Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FGR’02)
0-7695-1602-5/02 $17.00 © 2002 IEEE
(2)
the ith person can be computed using the forward algorithm
Average Rate Distortion curve for UMD database
80
Pi = log(P (OiX ji ))
Distortion −−−>
70
60
50
40
30
20
1
2
3
4
5
6
Number of centroids −−−>
7
8
Figure 2. Average Rate Distortion Curve for
UMD database
Oji (k) represents the observation sequence of the person
i encoded in terms of the stances of person j . Note that the
dimension of Ojj (k) is only 5. The new 5-D vector, which is
a measure of the similarity between the observed image and
the five stances, has the following significance:
4.3
Firstly, note that by virtue of self-similarity, the encoding of a width vector of person j in terms of the
width vectors of the stances of person j will yield a
lower Euclidean distance than when it is encoded in
terms of the width vectors of person i. For instance,
the five dimensional vector for a short person generated using the stances of a tall person will be large in
magnitude.
In addition, the manner in which every component of
this vector evolves with time encodes the transitional
information unique to a person. This transitional information could be the key factor that can distinguish
between two individuals that are structurally similar.
Gait Recognition
The stances together with the HMM represent the gait of
an individual. For robust recognition, it is reasonable that
one must examine several walk cycles before taking a decision i.e., instead of looking at a single half walking cycle, it
is beneficial to examine multiple half-cycles of a person before any conclusion about his/her gait can be reached. We
assume that several walk cycles of an individual are available. The problem is to recognize this individual from a
database of people whose gait models are available.
Given the image sequence of the unknown person X , the
width vector OW X of this person is generated. Using the
stances S1i ; ; S5i for person i in the database, we compute
the Euclidean distance of the width vector OW X of the unknown person w.r.t. the stances of person i to yield OiX (k)
for the kth frame. The likelihood that the observation sequence OiX was generated by the HMM corresponding to
(3)
where i is the HMM model corresponding to the person
i. We repeat the above procedure for every person in the
database thereby producing probabilities P j ; 1 j N .
Suppose that the unknown person was actually person m. If
the values of P1; ; PN are observed for a sufficient number of half cycles of the person X , we expect that in a majority of cases Pm would be higher compared to the rest of
the Pis. We shall present our results in a format similar to
the FERET protocol [8].
5
Experimental Results
For our experiments, the video sequences were taken
from the following databases:
1. Little and Boyd’s database [5]. This has 5 people with
around 22 walk cycles for each subject. Half of the
cycles were used for training and the other half for
testing. The data was collected by a camera mounted
on a tripod. The number of pixels on target was about
100.
2. University of Maryland (UMD) database: It has 43
people walking in a T-shaped path. The data was collected by a surveillance camera mounted at a height of
15ft. It has two sequences collected on different days
for each person , each with 10 cycles. One sequence
was used for training and the other for evaluation. The
number of pixels on target was about 150.
3. Carnegie Mellon University (CMU) database: It has
25 people walking at a fast pace and slow pace on a
treadmill. There are about 16 cycles in each sequence.
Half of the cycles were used for training and the other
half for testing. The data was collected by a camera
mounted on a tripod. The number of pixels on target
was about 630.
It should be pointed out here that a walk cycle consists of
two strides(or half cycles) strides. Since we use only the extremities of the silhouette the two halves of the walk cycle
are almost indistinguishable. The cycles we mention for the
databases are really the half cycles.
Training:
Silhouettes corresponding to a walk cycle are extracted for
each person in the database using the silhouette extraction
procedure described in Section 3.1. The width vector is generated for each frame and encoded as a compact 5-D observation sequence using the stances of that person. This lower
dimensional vector sequence (possibly of varying length)
constitutes a training sequence. We train a 5-state, single
Gaussian, ergodic HMM for each person. As expected,
the transition probabilities and the observation probabilities
Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FGR’02)
0-7695-1602-5/02 $17.00 © 2002 IEEE
T nR
1
2
3
4
5
1
10
0
2
10
2
2
3
1
9
4
1
0
T nR
5
2
4
1
1
2
7
2
3
3
2
4
1
6
5
Table 2. Confusion Matrix fo r 3-state HMM
9
11
Table 1. Confusion Matrix
T nR
2
4
1
3
2
7
2
3
3
4
1
6
5
Table 3. Confusion Matrix fo r 8-state HMM
turned out to be different for different people. We use the
holdout method for error estimation.
Recognition:
Given the gait cycles of an unknown person X and HMM
models i and stances for person i 1 i N , we compute the 5-dimensional vector OiX . Using the Viterbi algorithm, we compute (3). The above procedure is repeated
with respect to each person in the database. We rank order
the person indices in descending order of the posterior probabilities. This procedure is repeated for several walk cycles.
We present our results in terms of a cumulative match score
(CMS) curve. It is also possible to give a confusion matrix
as shown in table 1 for the Little and Boyd’s database.
In the Little and Boyd database, Persons 2, 3 and 4
have similar structural characteristics and expectedly the
false alarms are also somewhat predominant for these three
subjects.The recognition results for the UMD and CMU
databases are shown in figures 3a and 3b respectively. The
result for the UMD database reveals that the performance of
the method does not degrade significantly with an increase
in the database size. However the slight drop in performance
is due to drastic changes in clothing conditions of some subjects and changes in illumination(causing very noisy binarized silhouettes). It is natural for a person to change his
speed of walking with time. The use of HMM enables us
to deal with this variability without explicit time normalization. However for certain individuals and as biomechanics
also suggests there is a considerable change in body dynamics as a person changes his speed which explains the slight
drop in recognition rates. Observe that the results on CMU
database when the HMM is trained using cycles from slow
walk and tested using cycles from fast walk, the result is
poor compared to the situation when the training and testing scenarios are reversed. This is because of the fact that
with an increased number of frames per cycle the A matrix
tends towards diagonal dominance on account of increased
number of self loops. This suggests that explicit state duration modeling may be of interest [10].
The issue of the number of states deserves special attention. The choice of the number of states in an HMM model is
always a tricky issue. The states of an HMM can be abstract
quantities and it is not necessary that they must correspond
to physical features of the underlying process. However, it
would definitely be interesting, if the physical phenomenon
can guide the choice of the number of states in an HMM. In
Section 2, we had conjectured that the states and stances in
a gait cycle are likely to be related.
We studied the performance of our gait recognizer as a
function of the number of states. We experimented with 3,
5 and 8-state HMMs. The results obtained for the Little and
Boyds database are reported. The worst case results for 3
and 8-state HMMs are given in Tables 2 and 3. From the
tables we note that there is a considerable reduction in accuracy as compared to the 5-state case. The optimal state
sequence obtained from the Viterbi algorithm revealed that
the transitions in the states occur approximately at the same
time instants that the shift in stances occurs in the observation sequence. On the other hand, the state sequence for 3state and 8-state models did not have a corresponding physical interpretation . Thus, it appears that a 5-state HMM is
best suited for our experiments, thereby confirming our conjecture relating the stances and the states.
6
Conclusion
In this paper, we have proposed an HMM-based approach
to represent and recognize gait. A methodology is adopted
to derive a low dimensional observation sequence from the
silhouette of the body during a gait cycle. Learning is
achieved by training an HMM for each person over several
gait cycles. Gait recognition is performed by evaluating the
log-probability that a given observation sequence was generated by an HMM model present in the database.
The method was tested on 3 different databases. In general, the recognition rates were found to be good. As anticipated, drastic changes in clothing adversely affects recognition performance. The method is sensitive to changes in
viewing angle beyond ten degrees. The method is reasonably robust to changes in speed. In the case of human gait
recognition we observed in some cases that the stride length
changed appreciably with walking speed causing a slight
drop in recognition performance. The method is however
not robust to drastic changes in the silhouettes which might
result due to changes in clothing or illumination.
Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FGR’02)
0-7695-1602-5/02 $17.00 © 2002 IEEE
Results on the CMU database
Results on the UMD database (# of cycles: train:10 test:10)
100
100
Cumulative Match score −−−−>
Cumulative Match score −−−−>
90
80
60
40
20
80
70
60
50
40
30
20
Train on fast walk (8 cycles); test on fast walk(8 cycles))
Train on slow walk(8 cycles); test on slow walk(8 cycles))
Train on fast walk (8 cycles) ; test on slow walk(8 cycles)
Train on slow walk (8 cycles); test on fast walk(8 cycles)
10
0
0
5
10
15
20
25
30
Rank −−−−>
35
40
45
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Rank −−−−>
(a)
(b)
Figure 3. Identi cation Performance for (a) UMD database (b) CMU database
Presently we are looking at ways to make the scheme invariant to viewing angle and scale which might occur due to
the use of multiple cameras. We are also exploring the use
of better image metrics to make the 5-D vector more informative. It should be stressed here that the scheme has the
potential to distinguish between humans and non-humans.
It can also be extended to classify different activities such
as walking and running. We are exploring the possibility of
activity independent person identification.
References
[1] J. Aggarwal and Q. Cai. Human motion analysis:a review.
Computer Vision and Image Understanding, 73(3):428–440,
March 1999.
[2] A. Bobick and A. Johnson. Gait recognition using static
activity-specific parameters. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (Lihue,
HI), December 2001.
[3] D. Cunado, J. Nash, M. Nixon, and J. N. Carter. Gait extraction and description by evidence-gathering. Proc. of the International Conference on Audio and Video Based Biometric
Person Authentication, pages 43–48, 1995.
[4] P. Huang, C. Harris, and M. Nixon. Recognizing humans by
gait via parametric canonical space. Artificial Intelligence in
Engineering, 13(4):359–366, October 1999.
[5] J. Little and J. Boyd. Recognizing people by theirgait: the
shape of motion. Videre, 1(2):1–32, 1998.
[6] H. Murase and R. Sakai. Moving object recognition in
eigenspace representation:gait analysis and lip reading. Pattern Recognition Letters, 17:155–162, 1996.
[7] M. Murray, A. Drought, and R. Kory. Walking patterns
of normal men. Journal of Bone and Joint surgery, 46A(2):335–360, 1964.
[8] P. J. Philips, H. Moon, and S. A. Rizvi. The feret evaluation methodology for face-recognition algorithms. IEEE
Trans. on Pattern Anal. and Machine Intell., 22(10):1090–
1100, October 2000.
[9] L. Rabiner. A tutorial on hidden markov models and selected
applications in speech recognition. Proceedings of the IEEE,
77(2):257–285, February 1989.
[10] M. Russell and R. K. Moore. Explicit modelling of state
occupancy in hidden markov models for automatic speech
recognition. Proceedings of IEEE Conference on Acoustics
Speech and Signal Processing, June 1985.
[11] T. Starner, J. Weaver, and A. Pentland. Real-time american
sign language recognition from video using hmms. IEEE
Trans. on Pattern Anal. and Machine Intell., 12(8):1371–
1375, December 1998.
[12] D. Wilson and A. Bobick. Nonlinear phmms for the interpretation of parameterized gesture. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (Santa
Barbara, CA), June 1998.
[13] J. Yamato, J. Ohya, and L. Ishii. Recognizing human action in time-sequential images using hidden markov model.
Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pages 624–630, 1995.
Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FGR’02)
0-7695-1602-5/02 $17.00 © 2002 IEEE
View publication stats