7 Referred
7 Referred
7 Referred
1. Introduction
Human-computer interaction (HCI) is a hot topic in the information technology age.
Operators can provide commands and get results with the interactive computing system
involves one or more interfaces. With the development of graphical operator interfaces,
operators with varying levels of computer skills have been allowed to use a wide variety of
software applications. Recent advancements in HCI provide more intuitive and natural ways
to interact with the computing devices.
There are many HCI patterns including facial expression, body posture, hand gesture,
speech recognition and so on. Among them hand gesture is an intuitive and easy learning
means. Wang et al., [1] proposed an automatic system that executes hand gesture spotting and
recognition simultaneously based on Hidden Markov Models (HMM). Suk et al., [2]
proposed a new method for recognizing hand gestures in a continuous video stream using a
dynamic Bayesian network or DBN model. Van den Bergh et al., [3] introduced a real-time
hand gesture interaction system based on a Time-of-Flight. Due to the diversity of hand
gesture, there exist some limitations for recognition. Comparatively, pointing gesture the
simplest hand gesture which can be easier to be recognized has attracted much attention. Park
and Lee [4] presented a 3D pointing gesture recognition algorithm based on a cascade hidden
Markov model (HMM) and a particle filter for mobile robots. Kehl and Gool [5] proposed a
multi-view method to estimate 3D directions of one or both arms. Michael [6] has set up two
orthogonal cameras from the top view and side view to detect hand regions, track the finger
pointing features, and estimate the pointing directions in 3D space. Most previous pointing
gesture recognition methods use the results of face and hand tracking [7] to recognize the
81
pointing direction. However, the recognition rate is limited by the unreliable face and hand
detection and tracking in 3D space. Another difficult problem is to recognize some small
pointing gestures which usually results in the wrong direction.
Body tracking or skeleton tracking techniques using an ordinary camera are not easy and
require extensive time in developing. From the body or skeleton tracking survey, to detect the
bone joints of human body is still a major problem since the depth of the human cannot be
determined through the use of a typical camera. However, some researchers have tried to use
more than one video camera to detect and determine the depth of the object, but the
consequence is that the cost increases and the processing ability slows down due to the
increased data processing. Fortunately, the Kinect sensor makes it possible to acknowledge
the depth and skeleton of operators. The Kinect has three autofocus cameras including two
infrared cameras optimized for depth detection and one standard visual-spectrum camera used
for visual recognition.
Take a recognition system robust and the cost of cameras into consideration, some
researchers have used Kinect sensor to provide a higher resolution at a lower cost instead. The
Microsoft Kinect sensor combines depth and RGB cameras in one compact sensor [8, 9]. Its
robust to all colors of clothing and background noise and provides an easy way for real-time
operator interaction [10-14]. There are some freely available libraries that can be used to
collect and process data from a Kinect sensor as well as skeletal tracking [15, 16].
Two main contributions are as follows in this paper. Firstly a new method to detect
pointing fingertip is proposed which is used to recognize pointing gestures and interact with
large screen instead of headhand line. Secondly a scalable and flexible virtual touch screen is
constructed which is adaptively adjusted as the operator moves as well as pointing arm
extends or contracts. Experimental results have shown that the developed method is robust
and efficient to recognize pointing gesture and realize HCI by comparisons.
The organization of the rest paper is as follows. In Section 2, pointing gesture recognition
method is described. How to construct a virtual touch screen is introduced in Section 3.
Experimental results are given in Section 4, and some appropriate conclusions are made in
Section 5.
82
H ,
Hp = r
Hl ,
z r < zl
(1)
else
Where Hp is the pointing hand, Hr refers to right hand and Hl refers to left hand, zr and zl
present the z coordinates of right and left hand respectively.
In order to catch the hand motion used for controlling the large screen, it is need to
separate the hand from the depth image. A depth image records the depth of all the pixels of
the RGB image. The depth value of hand joint can be figured out through skeleton-to-depth
conversion, and it is taken as a segmentation threshold which is used to divide the hand
region Hd(i, j) from the raw depth image.
(2)
Where d(i, j) is the pixel of depth image D, Td is the depth of the pointing hands joint, e
refers to a small range around threshold Td, m is the width of D, and n is the height of D.
2.2. Pointing Fingertip Detection
Most of existing methods for pointing gesture recognition use the operator hand-arm
motion to determine the pointing direction. To further study hand gestures, some approaches
for fingertips detection have been proposed. In this paper, pointing fingertip [17-19] with
more precise is detected instead of hand to interact with large screen. The hands minimum
bounding rectangle makes it easy to extract the pointing fingertip. The rules for pointing
fingertip detection especially index fingertip are as follows:
(1) Extract the tracked hand region using the minimum bounding rectangle by skeletal
information as well as the corresponding elbow.
(2) When the operators hand is pointing to left, his or her pointing hand joints coordinate
in the horizontal direction (x coordinate) is less than the corresponding elbows.
At this moment, if the hand joints coordinate in vertical direction (y coordinate) is also
less than elbows and the width of minimum bounding rectangle is less than its height, its not
difficult to find that index fingertip moves along the bottom edge of bounding rectangle;
If the pointing hand joints x coordinate is larger than the corresponding elbows and the
width of the minimum bounding rectangle is less than its height, the index fingertip moves
along the up edge of bounding rectangle; If the minimums width is larger than its height, the
index fingertip moves along the left edge of bounding rectangle.
(3) When the operators hand is pointing to right, his or her pointing hand joints x is larger
than that of the corresponding elbows one, the distinguishing rule is the same as step (2).
The distinguishing rules above are to find the pointing fingertip in depth image, and its 3D
coordinates can be figured out through depth-to-skeleton conversion.
2.3. Pointing Fingertip Tracking
In some cases, the fingertip will be detected by mistaken due to the surrounding
environmental interference. To avoid false detection, a tracking technique is introduced to
track the feature points. A Kalman filter [20-22] is used to record the detected fingertip
motion trajectory, the features of pointing gestures (including fingertips position and speed)
can be extracted through this tracking method.
83
(3)
where x, y, z presents the image coordinates of the detected fingertip, and vx, vy, vz presents
its displacement.
The Kalman filter model assumes the true state at time k is evolved from the state at (k1)
according to (4).
x(k) = (k)x(k-1) + W(k)
(4)
where (k) is the state transition model which is applied to the previous state x(k1);W(k)
is the process noise which is assumed to be drawn from a white Gaussian noise process with
covariance Q(k).
At time k an observation (or measurement) z(k) of the true state x(k) is made according to
(5).
z(k) = H(k)z(k) + V(k)
(5)
where H(k) is the observation model which maps the true state space into the observed
space and V(k) is the observation noise which is assumed to be zero mean Gaussian white
noise with covariance R(k).
In our method, the Kalman filter is initialized with six states and three measurements, the
measurements correspond directly with x, y, z in the state vector.
2.4. Pointing Gestures Recognition
Human motion is a continuous sequence of actions or gestures and non-gestures without
clear-cut boundaries. Gestures recognition refers to detecting and extracting meaningful
gestures from an input video. It is crucial how to recognize pointing gesture since the
recognition procedure is only performed for the detected pointing gestures. When a person
makes a pointing gesture, the whole motion can be separated into three phases including
non-gesture (hands and arms drop naturally, its unnecessary to find fingertip), move-hand
(the pointing hand is moving and its direction is changing), point-to (the hand is
approximately stationary). Among the three phases, only the point-to phase or the
corresponding pointing gesture is relevant to target selection.
To recognize pointing gestures, the three phases must be distinguished. When operator
moves his hand, the velocity at time t is estimated by (6)
v x = xt xt 1 , v y = y t y t 1 , v z = z t z t 1
(6)
else
(7)
When n reaches a certain amount, it presents the operator is pointing at the interaction
target, & is logical AND operation.
84
Large
(x2,y1)
scree
n disp
lay m
odel
(xm-1,y1)
(x1,y2)
Virtual touch screen
(x1',y1')
(xm,y1)
(x1,yn-1)
(xm',y1')
The center
of face
(x1,yn)
(0,0,0)
X
(8)
where z is 0, and z equals to the pointing fingertips z coordinate, xh, yh and zh refer to the
3D coordinates of operators head joint.
The corresponding coordinates on virtual touch screen can be figured out as (9), (10),
xi' =
( x h xi ) ( z f z h )
zh
+ zh
(9)
85
y 'j =
( y h y j ) ( z f zh )
zh
+ yh
(10)
The virtual touch screen will move forward or backward with movement of the pointing
fingertip as well as human motion. If operators pointing fingertip stays in one block of
virtual screen for a while, the corresponding block on the large screen will be triggered.
H R = {R | R J H }
(11)
Where R refers to the separable regions by threshold segmentation, JH is the pointing hand
joint.
86
The first row of Figure 3 shows RGB image and its corresponding depth image, the second
row shows detected hand region.
The pointing fingertip is extracted with the method described in Section 2.2 shown in
Figure 4. The position of pointing fingertip in 3D space can be figured out through
depth-to-skeleton conversion.
(a)
(b)
Figure 4. Pointing Fingertip Detection with Different Pointing Gestures. (a)
Right Hand is Performing Pointing Gesture. (b) Left Hand is the Pointing Hand
As described in Section 2.4, (6) and (7) are used to determinate if the operator performs
pointing gesture. When the operator is pointing at one lamp, his or her fingertip is
approximately stationary. According to some statistics, hand trembling range is within 0 to
4mm. In addition, a lot of experiments have been done to choose a favorable n for effective
pointing gesture recognition.
Rp =
Mr
Ma
(12)
87
Where Rp presents the rate of pointing gesture recognition, Mr refers to the frame number
of recognized pointing gesture and Ma refers to the actual pointing frame number.
(a)
(b)
Rc =
Nc
100%
Np
(13)
Where Rc refers to the correct target selection rate, Np is the times of pointing to one target
and Nc is the times of correct target selection.
88
In the progress of the experiment, we find that sometimes the pointing fingertip is detected
by mistaken, which directly affects the results of target selection. So as described in Section
2.3, a Kalman filter is applied to record the pointing fingertip motion trajectory to decrease
the false detection.
Experimental results with Kalman filter or not are given in Table 1, Table 2 respectively.
Confusion matrix is used to summarize the results of target selection. Each column of the
matrix represents the recognized target, each row represents the actual targets, and the data
represents the average rates of correct target selection. L1, L2, L3, L4, L5, L6 refer to the six
lamps fixed on the wall as shown in Figure 6, respectively.
Table 1. Results of Target Selection without Kalman Filter
L1
L2
L3
L4
L5
L6
L1
96.9%
2.2%
0
0
0
0
L2
3.1%
97.8%
0
0
0
0
L3
0
0
99.0%
0.4%
0
0
L4
0
0
1.0%
99.6%
0
0
L5
0
0
0
0
96.9%
2.2%
L6
0
0
0
0
3.1%
97.8%
L1
L2
L3
L4
L5
L6
L1
L2
L3
L4
L5
L6
98.4%
0.5%
0
0
0
0
1.6%
99.5%
0
0
0
0
0
0
99.6%
0
0
0
0
0
0.4%
100%
0
0
0
0
0
0
98.4%
0.5%
0
0
0
0
1.6%
99.5%
To further analyze the proposed method, another six volunteers perform HCI based on
pointing gesture to control the lamps on the wall. The performance of proposed method is
compared with that of methods developed by Yamamoto et al., in [7] and Cheng in [23],
respectively. The comparison result is given in Table 3. One can note that the proposed
method is best by comparison from Table 3.
Table 3. Comparisons of Correct Target Selection Rate in Different
Methods
Rc (%)
Average Rc (%)
Method in [7]
90.1%
Method in [23]
94.8%
Proposal
99.2%
5. Conclusions
A new 3D real-time method is developed for HCI based on pointing gesture recognition.
The proposed method involves the use of a Kinect sensor and a flexible virtual touch screen
89
for interacting with a large screen. This method is free from illumination and surrounding
environmental changes using the depth and skeletal map generated by Kinect sensor. In
addition, the target selection is only relevant to operators pointing fingertip position, so its
suitable for both large and small pointing gestures as well as different operators.
Due to the widely application of speech recognition in HCI, our further work will focus on
the combination of speech and visual-multimodal features to improve the recognition rate.
Acknowledges
This work is supported in part by the Natural Science Foundation of China (Grant No.
11176016, 60872117), and Specialized Research Fund for the Doctoral Program of Higher
Education (Grant No. 20123108110014).
Reference
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
90
M. Elmezain, A. Al-Hamadi and B. Michaelis, Hand trajectory-based gesture spotting and recognition using
HMM, Proceedings of International Conference on Image Processing, (2009), pp. 3577-3580.
H.-I Suk, B.-K. Sin and S.-W. Lee, Hand gesture recognition based on dynamic Bayesian network
framework, Pattern Recognition, vol. 9, no. 43, (2010), pp. 3059-3072.
M. Van Den Bergh and L. Van Gool, Combining RGB and ToF cameras for real-time 3D hand gesture
interaction, Proceedings of 2011 IEEE Workshop on Applications of Computer Vision, (2011), pp. 66-72.
C.-B. Park and S.-W. Lee, Real-time 3D pointing gesture recognition for mobile robots with cascade HMM
and particle lter, Image and Vision Computing, vol. 1, no. 29, (2011), pp. 51-63.
R. Kehl and L. V. Gool, Real-time pointing gesture recognition for an immersive environment, Proceedings
of 6th IEEE International Conference on Automatic Face and Gesture Recognition, (2004), pp. 577-582.
J. R. Michael, C. Shaun and J. Y. Li, A multi-gesture interaction system using a 3-D iris disk model for gaze
estimation and an active appearance model for 3-D hand pointing, IEEE Transactions on Multimedia, vol. 3,
no. 13, (2011), pp. 474-486.
Y. Yamamoto, I. Yoda and K. Sakaue, Arm-pointing gesture interface using surrounded stereo cameras
system, Proceedings of International Conference on Pattern Recognition, vol. 4, (2004), pp. 965-970.
M. Van Den Bergh, D. Carton and R. De Nijs, Real-time 3D hand gesture interaction with a robot for
understanding directions from humans, Proceedings of IEEE International Workshop on Robot and Human
Interactive Communication, (2011), pp. 357-362.
Y. Li, Hand gesture recognition using Kinect, Proceedings of IEEE 3rd International Conference on
Software Engineering and Service Science, (2012), pp.196-199.
M. Raj, S. H. Creem-Regehr, J. K. Stefanucci and W. B. Thompson, Kinect based 3D object manipulation
on a desktop display, Proceedings of ACM Symposium on Applied Perception, (2012), pp. 99-102.
A. Sanna, F. Lamberti, G. Paravati, E. A. Henao Ramirez and F. Manuri, A Kinect-based natural interface
for quadrotor control, Lecture Notes of the Institute for Computer Sciences, Social-Informatics and
Telecommunications Engineering, 78 LNICST, (2012), pp. 48-56.
N. Villaroman, D. Rowe and B. Swan, Teaching natural operator interaction using OpenNI and the Microsoft
Kinect sensor, Proceedings of ACM Special Interest Group for Information Technology Education
Conference, (2011), pp. 227-231.
S. Lang, M. Block and R. Rojas, Sign language recognition using Kinect, Lecture Notes in Computer
Science, (2012), pp. 394-402.
L. Cheng, Q. Sun, H. Su, Y. Cong and S. Zhao, Design and implementation of human-robot interactive
demonstration system based on Kinect, Proceedings of 24th Chinese Control and Decision Conference,
(2012), pp. 971-975.
J. Ekelmann and B. Butka, Kinect controlled electro-mechanical skeleton, Proceedings of IEEE Southeast
Conference, (2012), pp. 1-5.
C. Sinthanayothin, N. Wongwaen and W. Bholsithi, Skeleton tracking using Kinect sensor and displaying in
3D virtual scene, International Journal of Advancements in Computing Technology, vol. 11, no. 4, (2012), pp.
213-223.
G. L. Du, P. Zhang, J. H. Mai and Z .L. Li, Markerless Kinect-based hand tracking for robot teleoperation,
International Journal of Advanced Robotic Systems, vol. 1, no. 9, (2012), pp. 1-10.
J. L. Raheja, A. Chaudhary and K. Singal, Tracking of fingertips and centers of palm using Kinect,
Proceedings of 3rd International Conference on Computational Intelligence, Modeling and Simulation, (2011),
pp. 248-252.
[19] N. Miyata, K. Yamaguchi and Y. Maeda, Measuring and modeling active maximum fingertip forces of a
human index finger, Proceedings of IEEE International Conference on Intelligent Robots and Systems,
(2007), pp. 2156-2161.
[20] N. Li, L. Liu and D. Xu, Corner feature based object tracking using adaptive Kalman filter, Proceedings of
International Conference on Signal, (2008), pp. 1432-1435.
[21] M. S. Benning, M. McGuire and P. Driessen, Improved position tracking of a 3D gesture-based musical
controller using a Kalman filter, Proceedings of 7th International Conference on New Interfaces for Musical
Expression, (2007), pp. 334-337.
[22] X. Li, T. Zhang, X. Shen and J. Sun, Object tracking using an adaptive Kalman filter combined with mean
shift, Optical Engineering, vol. 2, no. 49, (2010), pp. 020503-1-020503-3.
[23] K. Cheng and M. Takatsuka, Estimating virtual touch screen for fingertip interaction with large displays,
Proceedings of ACM International Conference Proceeding Series, (2006), pp. 397-400.
91
92