Hand posture analysis for visual-based human-machine interface
Abdolah Chalechale, Farzad Safaei
Smart Internet Technology CRC
University of Wollongong
Wollongong, NSW, 2522, Australia
{ac82,farzad}@uow.edu.au
Golshah Nagdy, Prashan Premaratne
School of Electrical, Computer and Telecom. Eng.
University of Wollongong
Wollongong, NSW, 2522, Australia
{golshah,prashan}@uow.edu.au
Abstract
This paper presents a new scheme for hand posture selection and recognition based on statistical classification. It
has applications in telemedicine, virtual reality, computer
games, and sign language studies. The focus is placed on
(1) how to select an appropriate set of postures having a
satisfactory level of discrimination power, and (2) comparison of geometric and moment invariant properties to recognize hand postures. We have introduced cluster-property
and cluster-features matrices to ease posture selection and
to evaluate different posture characteristics. Simple and fast
decision functions are derived for classification, which expedite on-line decision making process. Experimental results confirm the efficacy of the proposed scheme where a
compact set of geometric features yields a recognition rate
of 98.8%.
1. Introduction
Human-machine interface (HMI) has become an essential part of our technological revolution. It offers both
consumers and providers enormous opportunities for expanded access. However, as with any burgeoning technological innovation, HMI faces a wide array of possibilities.
More generally, virtual reality, as an artificial creation of
interactive environment resembling real life, is attracting
more attention among researchers. Furthermore, in many
telemedicine applications such as remote patient care and
smart home-based health care devices, patients are remotely
monitored. In such applications, ambient intelligence is integrated into the monitoring devices such as cameras in order to measure patients’ gestures and postures.
The technology for on-line interaction in all of above applications over the Internet is maturing due to advances in
communication tools and modern video transcoding expertise. Users usually interact with machines using keyboard,
mouse, joystick, trackball, or wired glove. Most of these
are special devices that, by and large, are designed to suit
computer hardware rather than human user. Nevertheless,
humans use gestures in daily life as a means of communication, for example hand shaking, head nodding, and hand
gestures are widely used in friendly communications. Using machine vision algorithms, a computer can recognize
the user’s gesture/posture and perform appropriate actions
required in virtual reality environments or in computer and
video games. This paper aims at application of posturebased interaction in the areas like telemedicine, sign language recognition, virtual reality, and computer and video
games.
Although several aspects of directing computers using
human gestures/postures have been studied in the literature
gesture/posture recognition is still an open problem. This
is due to significant challenges in response time, reliability, economical constrains, and natural intuitive gesticulation restrictions [9]. The MPEG-4 standard has defined Facial Animation Parameters to analyze facial expressions and
convert them to some predefined facial actions [6]. Principal component analysis has been used for hand posture
recognition [2]. Jian et al. [8] has developed a lip tracking system using lip contour analysis and feature extraction. Similarly, human leg movement has been tracked using color marks placed on the shoes of the user to determine
the type of leg movement using a first-order Markov model
[3].
A neural network-based computing system has been
used in [14] to extract motion qualities from a live performance. The inputs to the system are both 3D motion capture
(where position and orientation sensors collect data from
the whole body of the performer) and 2D video projections.
This system, which has been used in an extended project
at the Center for Human Modeling and Simulation, University of Pennsylvania, provides the capability of automating
both observation and analysis processes. Finally it produces
natural gestures for embodied communicative agents. The
performer wears a black cloth in a dark background to facilitate hand and face detection tasks.
Davis and Shah [4] have developed a method for recognizing hand gestures applying a model-based approach.
Here, a finite state machine is employed to model four qualitatively distinct phases of a generic gesture. Binary marked
gloves are exploited to track fingertips. Gestures are broken
to postures and represented as a list of vectors and are then
matched to some stored vectors using table lookup.
Invariant moments have been widely used for gesture/posture detection. Ng et al. [11] have proposed a
system for automatic detection and recognition of human
head gestures/postures. It combines invariant moments and
hidden Markov model (HMM) for feature extraction and
recognition tasks, respectively. The best advantage of this
approach is that it can operate in a relatively complex background. However, the computational requirements arising
from the invariant moments extraction and HMM’s application render the approach inappropriate for real-time applications where several gestures/postures are involved. As
a result, the system can only recognize ”YES”, ”NO”, and
”PO” head gestures.
In some circumstances it is necessary to ignore motion
path analysis of the gestures for fast processing. This kind
of analysis is referred to as posture analysis. In this paper
we propose a new discipline on how to depict a set of appropriate hand postures for applications aiming at visual-based
interface. This is to find simple but robust postures which
could be easily recognized and have distinguishing features.
This study addresses two aspects of posture recognition for
human-machine interface. First, which postures are more
recognizable, and second how to extract features which incorporate both recognition power and speed requirements in
such applications. Towards these goals, we have developed
a novel methodology based on recognition rates and introduce two matrices: cluster-property and cluster-features.
The former is a structure to save single-valued properties of
the postures while the latter is for multiple-valued feature
vectors describing posture images.
The rest of the paper is organized as follows: next section
explains our approach in detail. Section 3 presents experimental results and finally Section 4 concludes the paper and
poses some new research directions.
2 Hand Posture Analysis
One of the most important aspects of HMI in virtual reality, telemedicine, and computer games, where user communicates with the program’s engine using his/her hand gestures/postures, is to reasonably select (or design) appropriate gestures/postures. This section presents a general
scheme on how to assess several possibilities. To explain
the proposed scheme we utilize a collection of 2080 hand
postures [2, 12], and show how the approach works on this
collection. The procedure can be adopted for other collec-
Figure 1. International sign language hand alphabet [2]
tions without any need to change its general structure.
Initially, the collection is grouped into 25-hand alphabet.
The images are 255-level gray scaled generated by a hand
in black sleeve in a dark background. Figure 1 shows representative postures and Figure 2 depicts some examples of
the images. Due to varying lighting conditions of the images within the database using a unique threshold to binarize images is inadequate. Figure 3 shows instances where
a unique threshold cause inappropriate segmentation of the
hand shape. For this, K-mean clustering is employed for
binirization in the pre-processing stage. This successfully
segments hand postures from the background (see Figure 3).
Size normalization using nearest-neighbor interpolation
is applied next. This is to achieve scale invariance property,
which allows different size postures to have similar features.
The bounding box of the region of interest is found first
and then normalized to w × h pixels (64 × 64 pixels in our
experiments).
Next, for each segmented-normalized posture g belonging to a posture group Gi , i = 1 . . . I, we extract J shape
properties Pj , j = 1 . . . J. Currently, for the hand collection, I is 25 and J is chosen to be 14 corresponding
to 25 posture clusters and 14 predominant posture properties respectively. The properties include seven geometric and seven invariant moment-based functions. Geometric
properties are: area (ar), perimeter (pr), major axis length
(mj), minor axis length (mi), eccentricity (ec), and the ratio
of ar/pr, and mj/mi. The invariant moment-based functions have been widely used in a number of applications
[7, 13, 10]. The first six functions (φ1 − φ6 ) are invariant
under rotation and the last one φ7 is both skew and rotation
invariant. They are based on the central i, j-th moments
(µij ) of a 2D image f (x, y), which are defined as follows:
XX
µij =
(x − x̄)i (y − ȳ)j f (x, y)
(1)
x
y
Then, the invariant moment-based functions are defined
as
φ1
φ2
φ3
φ4
φ5
φ6
Figure 2. Hand posture samples
φ7
=
=
=
=
=
η20 + η02
2
(η20 + η02 )2 + 4η11
2
(η30 − 3η12 ) + (3η21 − η03 )2
(η30 + η12 )2 + (η21 + η03 )2
(η30 − 3η12 )(η30 + η12 )
£
¤
· 3(η30 + η12 )2 − 3(η21 + η03 )2
+(3η21 − η03 )(η21 + η03 )
£
¤
· 3(η30 + η12 )2 − 3(η21 + η03 )2
£
¤
= (η20 − η02 ) (η30 + η12 )2 − (η21 + η03 )2
+4η11 (η30 + η12 )(η21 + η03 )
= (3η21 − η03 )(η30 + η12 )
£
¤
· (η30 + η12 )2 − 3(η21 + η03 )2
−(η30 − 3η12 )(η21 + η03 )
£
¤
· 3(η30 + η12 )2 − 3(η21 + η03 )2
(2)
where ηij = (µij )/(µγ00 ) and γ = (i + j)/2 + 1.
To determine the recognition power of each Gi cluster,
we exploit a classification scheme using the properties Pj .
Initially; we try to classify 500 randomly selected postures
(20 of each group) into the associated groups. Recognition rates Rij for i = 1 . . . I and j = 1 . . . J are obtained
and saved in appropriate entries in an cluster-property matrix. The classification is based on Bayesian rule assuming
Gaussian distribution for the hand posture patterns [1, 2]. To
extract a decision function for our classifier, we consider J
number of 1D probability density functions. Each function
involves I pattern groups governed by Gaussian densities,
with means mij and standard deviation σij . Therefore, the
Bayes decision function have the following form [5]:
dij (g) = p(g/Gi )P (Gi )
(3)
that is identical as
i
h
(g−mij )2
−
2
1
2σ
ij
dij (g) = √
P (Gi )
e
2πσij
Figure 3. Instances where lower thresholds
make many unwanted noisy regions (upper
two images) and higher thresholds destroy
the hand region (middle two images), while Kmean clustering segments hand region properly (lower two images)
(4)
for i = 1 . . . I and j = 1 . . . J, where p(g/Gi ) is the probability density function of the posture pattern g from cluster
Gi and P (Gi ) is the probability of occurrence of the corresponding cluster.
Assuming equally likely occurrence of all classes (i.e.,
P (G1 ) = P (G2 ) · · · = P (Gi ) · · · = P (GI ) = 1/I), and
because of the exponential form of the Gaussian density,
which persuade the use of natural logarithm, and since the
logarithm is a monotonically increasing function, the decision function in Eq. 4 can be modified to a more convenient
form. In other words, based on the aforementioned assumption and facts, we can use the following decision function,
which is less computationally expensive and much faster for
the classification of hand postures:
dij (g)
= ln [p(g/Gi )P (Gi )]
= ln p(g/Gi ) + ln P (Gi )
where Eik {·} denotes the expected value of the argument
over the postures of class Gi using multiple-valued property
Pk . Approximating the expected value Eik by the average
value of the quantities in question yield an estimate of the
mean vector and covariance matrix as
1 X
mik =
ξ
(11)
Ni
ξ∈Gi
(5)
and
Cik =
considering Eq. 4, it can be written as
Dropping the constant values − 12 ln 2π and ln P (Gi ),
which have no effect on numerical order of the decision
function, an expeditious decision function is obtained as
(g − mij )2
2
2σij
(12)
ξ∈Gi
(g − mij )2
1
+ ln P (Gi ) (6)
dij (g) = − ln 2π − ln σij −
2
2
2σij
dij (g) = − ln σij −
1 X
(ξξ T − mik mTik )
Ni
(7)
for i = 1 . . . I and j = 1 . . . J, where mij and σij are
the mean and standard deviation of posture group Gi using
property Pj , and g is the corresponding scalar property of
an unknown posture.
Utilizing the above classification approach we calculate
recognition rates Rij for each single-valued property Pj and
for each posture group Gi and save them in the crossing
cells of the corresponding rows and columns of the clusterproperty matrix.
Next, to appraise a combinatory analysis and depict an
efficient feature vector to be used for posture recognition,
a set of K = 18 different combinations of the geometric
properties and invariant moment-based functions is generated and recognition rates are obtained. Here, since the
properties are multiple-valued, the decision function for the
classification is obtained differently. In the multiple-valued
case, the Gaussian density of the vectors in the ith posture
class has the form
−1
T
1
1
e[− 2 (ξ−mik ) Cik (ξ−mik )]
(2π)n/2 |Cik |1/2
(8)
for k = 1, 2, . . . K, where ξ is the extracted feature vector
of an unknown posture and n is the dimensionality of the
feature vectors, | · | indicates matrix determinant. Note that
each density is specified completely by its mean vector mik
and covariance matrix Cik , which are defined as
p(ξ/Gi ) =
mik = Eik {ξ}
(9)
Cik = Eik {(ξ − mik )(ξ − mik )T }
(10)
and
where Ni is the number of posture vectors from class
Gi and summation is taken over those vectors for k =
1, 2, . . . K.
To obtain a simple decision function for the multiplevalued case, considering that the logarithm keeps numerical order of its argument, substituting Eq. 8 in dik (ξ) =
ln [p(ξ/Gi )P (Gi )] yields
dik (ξ)
= −(n/2)
ln |Cik |− ¤
£ ln 2π − (1/2)
−1
(1/2) (ξ − mik )T Cik
(ξ − mik ) −
ln P (Gi )
(13)
Once again, the term −(n/2) ln 2π is the same for all
cases and if all classes are equally likely to occur, then
P (Gi ) = 1/I for i = 1, 2, . . . , I that is a constant and
has no effect on the numerical order of the decision function. Hence, a simple and expeditious decision function is
obtained as
−1
dik (ξ) = − ln |Cik | − (ξ − mik )T Cik
(ξ − mik )
(14)
for i = 1 . . . I and k = 1 . . . K. Note that Cik values are
independent of the input ξ, which means they can be calculated off-line and saved in a look-up table. They are fetched
from the look-up table at on-line stage to accelerate decision
making process.
The diagonal element crr is the variance of the rth element of the posture vector and the off-diagonal element
crs is the covariance of xr and xs . When the elements
xr and xs of the feature vector are statistically independent, crs = 0. This property has been used to identify
autonomous features and to pick them in the combination
of features in multiple-valued properties. Noteworthily, this
fact renders the multivariate Gaussian density function to
the product of univariate density of each element of ξ vector when the off-diagonal elements of the covariance matric
Cik are zero. This in turn expedites the generation of the
look-up table.
The recognition rates Rik for i = 1 . . . I and k = 1 . . . K
are calculated utilizing Eq. 14 and saved in appropriate entries in another structure called cluster-features matrix. This
represents not only the distinguishably of the isolated hand
postures but also the recognition power of different sets of
features to describe postures.
The general paradigm explained above provides a
straightforward method to select distinguishable postures
and has been shown to be effective in experimental results
(next section). More importantly, column-wise summations
in the cluster-property and cluster-features matrices indicate the recognition power of the simple properties and
complex features respectively. Row-wise summations exhibit the discrimination power of each posture, which is an
important clue to the selection of postures for the application in use.
3 Experimental Results
As stated before, a database of 2080 hand postures is
used for the experiments. The database is publicly available in [12]. There are 25 sets of postures having number
of members from 40 to 100. In the training stage the statistical model parameters are obtained. These include means
and standard deviations (scalars) for individual properties
and means (vectors) and covariance matrices for combined
features. In the recognition stage 500 randomly selected
postures (20 in each of 25 groups) from the database were
applied and tried to do classification using the approach explained in Section 2.
For each test posture the singular properties and the feature vectors are obtained. These are to evaluate a specific
posture based on its geometric properties and feature sets respectively. The recognition rate in each entry in the clusterproperty matrix is the number of correctly classified postures divided by the number of inputs. For example, if
12 out of 20 number of input postures in the cluster G10
are correctly classified by the decision function given in
Eq. 7 using perimeter property into the same cluster, then
the recognition rate in row G10 , column pr of the clusterproperty matrix is calculated to be 12/20=60%. In this part,
14 individual properties (7 geometric and 7 invariant-based
functions) are examined for the 25 posture groups. To be
able to compare recognition power of different properties,
an overall recognition rate is obtained for each column of
the matrix by simply averaging the recognition rates in that
column. The overall results show that the top three best singular properties are mj, mi, and ar/pr. The top five best distinguishable postures, which are explored using row-wise
averaging of the recognition rates in the cluster-property
matrix are depicted in Figure 4.
Next, we tried to classify test postures using 18 combinatory feature sets. The recognition rates are obtained using
the decision function in Eq. 14 and the results are saved in
the cluster-features matrix, which currently in our experiments has 18 columns. The rows corresponds to hand pos-
Figure 4. The top best five postures,in rowwise order, based on the data in the clusterproperty matrix
ture clusters and the columns corresponds to a variety combination of features (feature vectors). The number of entries
in the feature vectors varying from two to seven. There are
a massive number of different combinations but we chose
only those properties which previously showed to have better discriminating power. These properties have tentatively
been chosen based on their independent characteristics using covariance matrices. The cluster-property and clusterfeatures matrices are relatively large and space limitation
preclude us to represent them here.
Moment-invariant functions showed lack of efficacy
while different combination of geometric properties exhibit higher recognition rates. The overall recognition
rate of 98.8% is obtained using a five-entry feature vector
{mj, mi, ec, ar, pr}.
4 Conclusion and Further Work
We proposed a novel paradigm to select efficient hand
postures using cluster-property and cluster-features matrices. The former includes recognition rates for different
postures using singular properties and the latter deals with
multiple-valued features. The recognition rates are obtained
utilizing two simplified decision functions. The proposed
approach can be used in telemedicine, virtual reality, video
games and sign languages aiming at visual-based interface.
Moreover, we have examined several features to discriminate hand postures in a simple, fast, and robust way, which
is necessary in real-time applications. The results explicitly show discrimination rank of individual hand postures,
which can be used to reasonably select appropriate postures
in different applications. Moreover, the combination of features have been examined and a small feature vector containing only five simple features yields an overall recognition rate of 98.8%.
The proposed approach can be applied on other postures
including limb, head, and whole body postures. Shape features extracted from the posture image can be easily evaluated for efficacy using the proposed scheme. Moreover,
we intend to employ the proposed approach in immersive
distributed environments, where several users using a distributed system communicate through their hand or body
gestures/postures. For further improvements, objective criteria for user satisfaction can be defined and a time-based
comparison can be accomplished.
Acknowledgments.
This work is supported by the
Smart Internet Technology Cooperative Research Centre
(SITCRC), Australia.
References
[1] H. Birk and T. B. Moeslund. Recognizing gestures from the
hand alphabet using principal component analysis. Master’s
thesis, Laboratory of Image Analysis, Aalborg University,
1996.
[2] H. Birk, T. B. Moeslund, and C. B. Madsen. Real-time
recognition of hand gestures using principal component
analysis. In Proc. 10th Scandinavian Conf. on Image Analysis (SCIA’97), 1997.
[3] C.-C. Chang and W.-H. Tsai. Vision-based tracking and interpretation of human leg movement for virtual reality applications. IEEE Trans. Circuits and Systems for Video Technology, 11(1):9–24, 2001.
[4] J. Davis and M. Shah. Visual gesture recognition. IEE
Proc. Vision, Image and Signal Processing, 141(2):101–106,
1994.
[5] R. C. Gonzalez and R. E. Woods. Digital Image Processing.
Addison-Wesley, 1992.
[6] ISO/IEC JTC 1/SC 29/WG 11 N 2502. Information
technology-generic coding of audio-visual objects-part 2:
visual. Technical report, ISO/IEC, Atlantic City, Oct. 1998.
[7] A. J. Jain and A. Vailaya. Shape-based retrieval: a case study
with trademark image databases. Patt. Recog., 31(9):1369–
1390, 1998.
[8] Z. Jian, M. N. Kaynak, A. D. Cheok, and K. C. Chung. Realtime lip tracking for virtual lip implementation in virtual environments and computer games. In Proc. IEEE Int. Conf.
Fuzzy Systems, volume 3, pages 1359–1362, 2001.
[9] H. Kang, C. W. Lee, and k. Jung. Recognition-based gesture spotting in video games. Pattern Recognition Letters,
25(15):1701–1714, 2004.
[10] D. Mohamad, G. Sulong, and S. S. Ipson. Trademark matching using invariant moments. In Proc. second Asian Conf.
[11]
[12]
[13]
[14]
Comput. Vision, [ACVV’95]., volume 1, pages 439–444,
Singapore, 1995.
P. C. Ng and L. C. D. Silva. Head gestures recognition. In
Proc. IEEE Int. Conf. Image Processing (ICIP), volume 3,
pages 266–269, 2001.
Thomas Moeslund’s Gesture Recognition Database.
http://www.vision.auc.dk/%7etbm/gestures/database.html/.
The URL has been visited on 10/2/2005.
S. J. Yoon, D. K. Park, S. Park, and C. S. Won. Image retrieval using a novel relevance feedback for edge histogram
descriptor of MPEG-7. In Proc. IEEE Int. Conf. Consumer
Electronics, pages 354–355, Piscataway, NJ, USA, 2001.
L. Zhao and N. I. Badler. Acquiring and validating motion qualities from live limb gestures. Graphical Models,
67(1):1–16, 2005.