Gesture Recognition With The Leap Motion Controller
Gesture Recognition With The Leap Motion Controller
Gesture Recognition With The Leap Motion Controller
2015
Jie Yuan
[email protected]
Hans-Peter Bischof
[email protected]
Recommended Citation
McCartney, Robert; Yuan, Jie; and Bischof, Hans-Peter, "Gesture Recognition with the Leap Motion Controller" (2015). Accessed
from
http://scholarworks.rit.edu/other/857
This Conference Proceeding is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Presentations and
other scholarship by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected].
Int'l Conf. IP, Comp. Vision, and Pattern Recognition | IPCV'15 | 3
3. Leap Motion Device Neural networks have typically been used to recognize
static gestures, but recurrent neural networks have also been
There are many motion sensing devices available in the
used to model gestures over time [12], [13]. One of the
marketplace. The Leap Motion controller was chosen for
main advantages of this type of model is that multiple inputs
this project because of its accuracy and low price. Unlike
from different sources can be fed into a single network,
the Kinect, which is a full body sensing device, the Leap
such as positions for different fingers, as well as angles [13].
Motion controller specifically captures the movements of a
Additionally, convolutional neural networks and deep learn-
human hand, albeit using similar IR camera technology.
ing models have been used with great success to recognize
The Leap Motion controller is a very small (1.2 3
offline handwriting characters [14], which can be considered
7.6cm) USB device [4]. It tracks the position of objects in
analogous to hand gestures under certain representations as
a space roughly the size of the top half of a 1m beach ball
shown in this paper. A similar problem domain of gesture
through the reflection of IR light from LEDs. The API allows
recognition, although in a lower dimensional space, is that of
access to the raw data, which facilitates the implementation
handwritten text recognition, where long-short term memory
of gesture recognition algorithms. A summary of the spec-
networks are the current state of the art [15], [16], [17].
ifications of the API: Language support for Java, Python,
JavaScript, Objective C, C# and C++; data is captured from
the device up to 215 frames per second; the precision of the
5. Dataset
sensor is up to 0.01mm in the perception range of 1 cubic
feet, giving it the ability to identify 7 109 unique points
in its viewing area.
The SDKv2 introduced a skeletal model for the human
hand. It supports queries such as the five finger positions
in 3 dimensional space, open hand rotation, hand grabbing,
pinch strength, and so on. The SDK also gives access to
the raw data it sees. Here we use this device to implement
and analyze different gesture recognition algorithms from a
dataset collected by this API.
4. Previous Work
Fig. 1: Leap Motion Visualizer
One commonly used method of recognition involves an-
alyzing the path traced by a gesture as a time series of
discrete observations, and recognizing these time series in a In order to examine various machine learning algorithms
hidden Markov model [5]. Typically, the discrete states are on gestures generated through the Leap Motion controller,
a set of unit vectors equally spaced in 2D or 3D, and the we needed to have a dataset that captured some prototypical
direction of movement of the recorded object between every gestures. To this end, a simple GUI was created that gave
two consecutive frames is matched to the closest of these users instructions on how to perform each of a chosen set
state vectors, generating a sequence of discrete directions of 12 hand gestures and provided visual feedback to the
of movement for each gesture path [6], [7], [8]. Hidden participant when the system was in the recording stage. All
Markov models have also been used to develop online gestures were performed by holding down the s key with
recognition systems, which record information continuously the non-dominant hand to record and then using the primary
and determine the start and stop point of a gesture as it hand to execute the gesture at a distance 6" to 12" above the
collects data in real time [8], [9]. top face of the controller. The code for this capture program
Another class of methods for recognition of dynamic is located online2 .
gestures involves the use of finite state machines to represent Students and staff on the RIT campus used the GUI
gestures [10], [11]. Each gesture can be represented as a to record their versions of each of 12 gesture types: one
series of states that represent regions in space where the finger tap, two finger tap, swipe, wipe, grab, release, pinch,
recorded object may be located. The features of these states, check mark, figure 8, lower case e, capital E, and capital
such their centroid and covariance, can be learned from F. The one and two finger taps were vertical downward
training data using methods such as k-means clustering. movements, performed as if tapping a single or set of keys on
When evaluating a new gesture, as the recorded object travels an imaginary keyboard. The swipe was a single left to right
through the regions specified by these states, these sequences movement with the palm open and facing downwards, while
of states are fed into finite state machines representing each the wipe was the same movement performed back and forth
of the trained gestures. In this way, gestures whose models
are consistent with the input state sequences are identified. 2 https://github.com/rcmccartney/DataCollector
Int'l Conf. IP, Comp. Vision, and Pattern Recognition | IPCV'15 | 5
several times. The grab motion went from a palm open to a For the convolutional neural network (CNN) that we employ
closed fist position, while the release was performed in the in Section 7, this dimensionality reduction took the form
opposite direction. Pinch was performed with the thumb and of converting each instance of real-valued, variable length
forefinger going from open and separated to touching. The readings into a fixed-size image representation of the gesture.
check mark was performed by pointing just the index finger
straight out parallel to the Z axis, then moving the hand in
a check motion while traveling primarily in the X-Y plane.
The figure 8, lower case e, capital E, and capital F
were all similarly performed by the index finger alone, in the
visual pattern indicated by their name in the plane directly
above the Leap Motion controller. The native Leap Motion
Visualizer shown in Figure 1 was available for each subject
to use alongside of our collection GUI while performing the
gestures if so desired, providing detailed visual feedback of
the users hand during motion.
As each gesture was performed, the Leap Motion API
was queried for detailed data that was then appended to
the current gesture file. The data was captured at over 100 Fig. 2: One instance example of each of the gestures used
frames per second, and included information for the hand for the CNN experiment
such as palm width, position, palm normal, pitch, roll, and
yaw. Positions for the arm and wrist were also captured.
For each finger 15 different features were collected, such as
position, length, width, and direction. In all, we collected
116 features for each frame of the recording, with the
typical gesture lasting around 100 to 200 frames, although
this average varies greatly by gesture class. Files for each
gesture are arranged in top-level folders by gesture type,
inside which each participant in the study has an anonymous
numbered folder that contains all of their gesture instances
for that class. Typically, each user contributed 5 to 10
separate files per gesture class to the dataset, depending on
the number of iterations each participant performed.
In all, approximately 9,600 gesture instances were col-
lected from over 100 members of the RIT campus, with Fig. 3: The mean image of the dataset on the left and the
the full dataset totaling around 1.5 GB. The data is hosted standard deviation on the right used for normalization
online for public download3 . Individual characteristics of
each gesture vary widely, such as stroke lengths, angles, CNNs traditionally operate on image data, using alternat-
sizes, and positions within the controllers field of view. ing feature maps and pooling layers to capture equivariant
Some users had used the Leap Motion before or were activations in different locations of the input image. Due to
comfortable performing gestures quickly after starting, while the complex variations that are nevertheless recognizable to
others struggled with the basic coordination required to a human observer as a properly performed gesture, CNNs
execute the hand movements. Thus, there is considerable offer a way to allow for differences in translation, scaling,
variation within a gesture class, and identifying a particular and skew in the path taken by an individuals unique version
gesture performed given the features captured from the Leap of the gesture. To transform each gesture into constant-sized
Motion device is not a trivial pattern recognition task. input for the convolutional network, we created motion im-
ages on a black canvas using just the 3 dimensional position
6. Image Creation data of the index finger over the lifetime of the gesture.
In its raw form, the varying temporal length of each ges- That is, for each frame we took the positions reported in
ture and large number of features make it difficult to apply the Leap coordinate axes, which varies approximately from
traditional machine learning techniques to this dataset. Thus, -200 to 200 in X and Z and 0 to 400 in Y , and transformed
a form of dimensionality reduction and normalization is those coordinates into pixel space varying from 0 to 200 in
needed for any learning technique to be effectively applied. three different planes, XY , Y Z, and XZ. For each reported
position, the pixels in the 5x5 surrounding region centered
3 http://spiegel.cs.rit.edu/~hpb/LeapMotion/ on the position were activated in a binary fashion. From this
6 Int'l Conf. IP, Comp. Vision, and Pattern Recognition | IPCV'15 |
but the model can be extended to give equal consideration to [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet
all 3 planes of 2D projections, allowing for a wide variety of classification with deep convolutional neural networks, in Advances
in Neural Information Processing Systems 25, F. Pereira, C. Burges,
gesture representations. Despite its good performance, one of L. Bottou, and K. Weinberger, Eds. Curran Associates, Inc., 2012,
the limitations of this model is that it cannot provide online pp. 10971105. [Online]. Available: http://papers.nips.cc/paper/4824-
recognition of gestures in real time. As future work we look imagenet-classification-with-deep-convolutional-neural-networks.pdf
[19] K. Simonyan and A. Zisserman, Very deep convolutional networks
to incorporate an alternative model, such as a hidden Markov for large-scale image recognition, arXiv preprint arXiv:1409.1556,
model, as a segmentation method to determine likely start 2014.
and stop points for each gesture, and then input the identified [20] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, Return of
the devil in the details: Delving deep into convolutional nets, arXiv
frames of data into the CNN model for gesture classification. preprint arXiv:1405.3531, 2014.
[21] A. Vedaldi and K. Lenc, Matconvnet convolutional neural networks
for matlab, CoRR, vol. abs/1412.4564, 2014.
References [22] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and
R. R. Salakhutdinov, Improving neural networks by preventing co-
[1] W. Joy and M. Horton. (1977) An introduction to display editing with adaptation of feature detectors, arXiv preprint arXiv:1207.0580,
vi. [Online]. Available: http://www.ele.uri.edu/faculty/vetter/Other- 2012.
stuff/vi/vi-intro.pdf [23] G. E. Dahl, T. N. Sainath, and G. E. Hinton, Improving deep
[2] A. S.-K. Pang, The making of the mouse, American Heritage of neural networks for lvcsr using rectified linear units and dropout,
Invention and Technology, vol. 17, no. 3, pp. 4854, 2002. in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE
[3] R. Loyola, Apples magic mouse offers multitouch International Conference on. IEEE, 2013, pp. 86098613.
features, p. 65, 01 2010. [Online]. Available: [24] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
http://search.proquest.com.ezproxy.rit.edu/docview/231461266?accountid=108 R. Salakhutdinov, Dropout: A simple way to prevent neural
[4] F. Weichert, D. Bachmann, B. Rudak, and D. Fisseler, Analysis networks from overfitting, J. Mach. Learn. Res., vol. 15,
of the accuracy and robustness of the leap motion controller, no. 1, pp. 19291958, Jan. 2014. [Online]. Available:
Sensors, vol. 13, no. 5, pp. 63806393, 2013. [Online]. Available: http://dl.acm.org/citation.cfm?id=2627435.2670313
http://www.mdpi.com/1424-8220/13/5/6380 [25] D. Eberly. (2015) Least squares fitting of data. Geometric Tools, LLC.
[5] L. Rabiner and B.-H. Juang, An introduction to hidden markov
models, ASSP Magazine, IEEE, vol. 3, no. 1, pp. 416, 1986.
[6] M. Elmezain, A. Al-Hamadi, J. Appenrodt, and B. Michaelis, A
hidden markov model-based continuous gesture recognition system
for hand motion trajectory, in Pattern Recognition, 2008. ICPR 2008.
19th International Conference on, Dec 2008, pp. 14.
[7] T. Schlmer, B. Poppinga, N. Henze, and S. Boll, Gesture recogni-
tion with a wii controller, in Proceedings of the 2nd international
conference on Tangible and embedded interaction. ACM, 2008, pp.
1114.
[8] H.-K. Lee and J.-H. Kim, An hmm-based threshold model approach
for gesture recognition, Pattern Analysis and Machine Intelligence,
IEEE Transactions on, vol. 21, no. 10, pp. 961973, 1999.
[9] S. Eickeler, A. Kosmala, and G. Rigoll, Hidden markov model based
continuous online gesture recognition, in Pattern Recognition, 1998.
Proceedings. Fourteenth International Conference on, vol. 2. IEEE,
1998, pp. 12061208.
[10] P. Hong, M. Turk, and T. S. Huang, Gesture modeling and recog-
nition using finite state machines, in Automatic face and gesture
recognition, 2000. proceedings. fourth ieee international conference
on. IEEE, 2000, pp. 410415.
[11] R. Verma and A. Dev, Vision based hand gesture recognition using
finite state machines and fuzzy logic, in Ultra Modern Telecommu-
nications & Workshops, 2009. ICUMT09. International Conference
on. IEEE, 2009, pp. 16.
[12] H. Hasan and S. Abdul-Kareem, Static hand gesture recognition
using neural networks, Artificial Intelligence Review, vol. 41, no. 2,
pp. 147181, 2014.
[13] K. Murakami and H. Taguchi, Gesture recognition using recurrent
neural networks, in Proceedings of the SIGCHI conference on Human
factors in computing systems. ACM, 1991, pp. 237242.
[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based
learning applied to document recognition, Proceedings of the IEEE,
vol. 86, no. 11, pp. 22782324, 1998.
[15] A. Graves, M. Liwicki, S. Fernndez, R. Bertolami, H. Bunke, and
J. Schmidhuber, A novel connectionist system for unconstrained
handwriting recognition, Pattern Analysis and Machine Intelligence,
IEEE Transactions on, vol. 31, no. 5, pp. 855868, 2009.
[16] A. Graves and J. Schmidhuber, Offline handwriting recognition with
multidimensional recurrent neural networks, in Advances in Neural
Information Processing Systems, 2009, pp. 545552.
[17] A. Graves, Offline arabic handwriting recognition with multidimen-
sional recurrent neural networks, in Guide to OCR for Arabic Scripts.
Springer, 2012, pp. 297313.