COMPUTER GAME INNOVATIONS, 2017
1
Gesture recognition in virtual reality
environments
Roman Chomik, Jarosław Andrzejczak
Abstract—In recent years there has been great progress in virtual reality systems utilizing 6D motion controllers, but there has been
relatively little progress in user interfaces taking advantage of this technology. We present a motion gesture recognizer based on an
artificial neural network. By utilizing both convolutional and recurrent layers we have developed a system which needs relatively few
training samples to label a wide variety of classes. With reduced input feature count our solution provides comparable success rate to
traditional hidden Markov model recognizer. This allows to use our system with motion controllers based on internal measurement
units.
Index Terms—gesture recognition, recurrent neural networks, virtual reality
✦
1
I NTRODUCTION
Gestures are a form of non-verbal communication in
which intent is expressed via motion of hands, face or other
body parts. Modern virtual reality systems often provide
tracking of hands and head movement. This data can be
used to infer person’s intent. Different devices provide
varying precision and amount of data. Systems based on
internal measurement units (IMU) provide only acceleration
and angular velocity, while more advanced controllers, such
as HTC Vive and Oculus Rift, provide absolute positional
and rotational information.
A gesture is described by the motion of a body part.
Classifying a gesture requires comparing it to previously
seen samples. For the purposes of gesture recognition we
define a gesture as a series of data points. In general the
speed at which a gesture is made does not affect it’s meaning. This means that gestures of the same class can vary in
length. The features contained in a single data point heavily
depend on the specifics of an input device.
In this paper we propose the motion gesture recognizer
based on an artificial neural network suitable for rapid
development of gesture based user interfaces utilizing modern motion controllers as well as legacy hardware. The
three contributions to gesture recognition in virtual reality
environments research presented in this paper are:
•
•
•
Artificial neural network with high recognition rates
with varying number of gesture classes and input
data format.
Recognizer that provides high success rate with reduced input feature count compared to traditional
hidden Markov model [4].
Original database for testing gesture recognition in
virtual reality containing over two thousand elements (30 gestures repeated 12 times each by 6 participants). Besides twenty gestures identical to those
used in [2] ten new, more complex gestures were
added.
We start with a discussion of research motivation and
related work in the next section. This is followed by a
description of a neural network architecture and input data
format. We then present the experimental setup details and
tests results. Finally, analysis of results and final conclusions
will be given.
2
R ELATED WORK
There has been some research into motion gesture recognition. Solutions developed so far are based on linear classifiers, AdaBoost and hidden Markov model [10] [2] [3]
[4]. Recognition rates vary depending on type of classifier,
number of training samples and testing scenario (userdependent, user-independent1 ), but in general are claimed
to be above 90%.
Linear classifiers work on a principle of manually specifying features based on the interpretation of input data.
They are generally based on research by Rubine [1]. Then
those feature vectors are compared to reference samples
using a linear classifier. This method requires some manual
work to define features which give good results. Using 5
gesture samples linear classifier provides recognition rates
of up to 96.3% and 95.6% in user-dependent and userindependent scenarios respectively [10].
Hidden Markov model based classifiers improve upon
linear classifiers. They do not require prior data processing
nor manual specification of features. Without data normalization results vary greatly. In order to give best result
data should be normalized prior to feeding it into the
classifier which requires the interpretation and analysis of
data obtained from input device. Hidden Markov model
based classifiers provide good recognition rates of up to
99.8% and 97.0% in user-dependent and user-independent
scenarios respectively [4].
3
N EURAL NETWORK ARCHITECTURE
Our recognizer is based on an artificial neural network.
Neural networks have been shown to succeed in task where
1. In user-dependent scenario both training and testing is performed using gesture samples obtained from the same tester. In userindependent scenario training and testing data sets are created using
samples from different users or disjoint user groups.
2
COMPUTER GAME INNOVATIONS, 2017
traditional methods have performed poorly or required
large workloads, such as speech recognition, language modelling or image analysis. Their main advantage is the ability
to learn features, which best describe the nature of the problem. In case of gesture recognition it allows for specifying
gestures by example rather than manually specifying features. This in turn makes it possible for the users themselves
to add new gestures to the system without exact knowledge
of how the system works.
Our neural network has been designed to work well
with a variety of input data formats in its raw form. The
data is assumed not to be normalized nor in any other way
processed. One of the neural network’s tasks is extraction of
meaningful features from various sensor reading and data
representations. It should be able to handle both inertial and
absolute readings, and provide acceptable recognition rates
with limited input data. To fulfil this task the neural network
consist of both convolutional and recurrent layers.
The neural network architecture is layered without skipping, i.e. each layer connects only to the next. It consists of
a convolutional layer followed by a number of recurrent
layers and a dense projection layer. Each recurrent layer
consists of Long Short-Term Memory (LSTM) cells [8]. The
projection layer accepts the last output of the recurrent
layer and reduces the number of outputs to the number of
classes of gestures. Its outputs can be directly interpreted as
probabilities of class occurrence in input data. The neural
network accepts inputs of varying length. The neural network consists of the following operations in order:
1)
2)
3)
4)
5)
6)
7)
8)
9)
Convolution
Bias
Rectifier activation (ReLU)
Dropout
LSTM recurrent
Dropout
LSTM recurrent
Dense
Softmax activation
Operations 1-3 are referred to as ”Convolutional layer”. Operations 4-7 are referred to as ”Recurrent layer”. Operations
8-9 are referred to as ”Projection layer” (Fig. 1).
3.1
Feeding data
The input to our recognizer is represented by a series
of vectors. To speed up training multiple gestures are
fed in batches. Each vector contains different components
recorded at a specific time-point, such as position vector,
rotation matrix, etc. Different types are represented as vectors and appended. Vectors in the series are sampled at a
constant frequency. As gestures can be of different lengths
they are padded with zeros to the length of the longest example in set. This potentially wastes a significant portion of
memory, but simplifies training by allowing arbitrary choice
of examples in batches. Data in this form is fed directly into
the neural network without any further preprocessing.
3.2
Convolutional layer
Convolution has been successfully used in image classification problems [6] [7]. Their structure allows for extraction
of local features that can occur in any position in input
image. Similarly we have used a convolutional layer to
learn local temporal features, i.e. features that occur within
short time span. Examples of such features include, but
are not limited to, specific change in velocity, position or
orientation. Convolution can also potentially be used to
detect correlations between different components of input
data. Image data is usually two-dimensional (width and
height) in nature with a number of color channels (usually
3). Our first dimension is time. In order to allow interactions between different components of input vector we treat
components as channels rather than the second dimension.
Thus our problem is one-dimensional (time) with number
of channels equal to the number of input components.
The first layer accepts the input data. It is treated as
an array and convolution is performed. The convolution
filter has a width of 4 and a stride of 1 is used. No data
padding on edges is applied as it is difficult to perform any
meaningful padding without exactly knowing data format
and its interpretation, which is exactly the kind of dependency we are trying to avoid. The output of convolutional
layer is shorter in length than its input, but contains many
more channels. These channels represent learned short term
temporal features at given time points in input stream. The
next operations is adding a learned bias to each of the
features. Then the ReLU activation function is applied. The
convolution layer outputs 128 channels.
3.3
Recurrent layer
Because gestures can be of different length and can be
executed with different speeds, we need a method that
can adapt to varying occurrence of features in input data.
Recurrent neural networks have been used to recognize
temporal patterns and as such have found use in speech
and handwriting recognition, automated translation and
language modeling. The recurrent neural network is created
by connecting a layer’s output as its input in the next time
step. The recurrent structure provides them with a limited
form of memory. The neural network has a direct access
to its current input and indirect access to features that
have been available in previous time steps. Because of very
large depth of such networks training them via gradient
descent methods has been infeasible due to the problem of
vanishing/exploding gradient. One solution to this problem
is Long Short-Term Memory cells [8]. This works by either
permitting entire signal to pass through to the next time
step or to be entirely blocked. This results in gradients being
passed exactly as is or blocked.
A gesture can be thought of as a sequence of basic
movements performed by a user, such as moving the hand
in specific direction or twisting the wrist. In each gesture
such sequence can be found which is common to all users
regardless of the exact way a gesture is performed, the
speed of execution or hand wielding the controller. Convolution extracts those basic movements whereas recurrent
layer learns how those features contribute into higher level
features. By stacking several recurrent layers the neural
network is capable of learning entire gestures.
The reccurent layer consists of a two sets of LSTM cells.
Each of this sets has a dropout applied to its input to prevent
ROMAN CHOMIK et al.: GESTURE RECOGNITION IN VIRTUAL REALITY ENVIRONMENTS
3
Fig. 1. The neural network operations order scheme showing Convolutional, Recurrent and Projection layer.
overfitting and improve generalization. The dropout mask is
preserved across all timesteps as described in [9]. Dropout
keep probability is shared between both sets and has the
value of 25% during training and 100% when used for
inference. The number of LTSM cell is equal to 128 for each
of the sets.
The projection layer is directly used for gesture classification. Based on the last valid output the recurrent layer,
the projection layer outputs confidence levels of gesture
belonging to each class. The projection layer is comprised of
a number of neurons equal to the number of output classes,
thus varies depending of data set used. Finally the softmax
function is applied. Due to the use of softmax function the
sum of outputs is always equal to 1 and individual outputs
can be interpreted as probabilities.
3.4 Training
Supervised learning is used to train the model. Training
is performed using gradient descent method. In order to
provide faster convergence Adam optimizer is used with
a learning rate of 0.001 [5]. Training is performed until a
predefined training loss is achieved or model no longer
converges for a specified number of epochs. No separate
validation data set is used. Values of 0.05 and 20 respectively
are used for target training loss and number of epochs
without convergence. Training is performed on GPU. The
size of batch is limited by available memory, thus depends
on maximum gesture length and model complexity. In our
case batch has a size of 200 gestures. Every epoch the
training set is shuffled and new batches are selected. This
prevents overfitting to a part of training data and improves
generalization.
4
E XPERIMENTAL SETUP
The aim of the performed tests was to examine proposed
artificial neural network gesture recognition success rate
and identify the minimal number of features needed to
obtain high success rate results (90% and more). Additionally we study the effect of number of gesture classes
anf their complexity. The neural network model has been
implemented2 using TensorFlow library [11].
2. Source code available
GestureRecognitionVR
at
https://github.com/romanchom/
Fig. 2. Shapes of additional gestures in Vive database. Dots denote the
beginning of gesture.
Two databases of gesture samples were available for
tests. The first database (named Wii) has been created by
the authors of [2] and is publicly available. It contains
samples generated by 28 testers aged 15-33. Each gesture
was performed 10 times by each participant and belongs to
one of 20 classes. The second database (named Vive) was
created by our team with the assistance of 6 testers using
HTC Vive motion controllers. Each participant was asked
to perform 30 gestures 12 times each. The first 20 gestures
classes are identical to the ones in [2], the remaining are
more complex shapes shown on fig. 2. Due to different data
formats, model trained on one database cannot be tested
against the other and vice versa.
We have created two experiment scenarios designed to
evaluate different aspects of the neural network. The first
experiment is designed to directly compare our proposed
solution to existing HMM model. The second experiment
studies the performance of neural network gesture recognizer with more complex shapes.
The first experimental setup is based on the one described in [2] [3] [4]. This allows direct comparison of
different gesture recognition solutions. The first scenario
4
COMPUTER GAME INNOVATIONS, 2017
TABLE 1
User-dependent recognition rates in percent of proposed neural
network for single features for both databases and comparison to HMM
results without (HMM w/o) and with normalization (HMM n). P position, O - orientation, V - velocity, W - angular velocity, A acceleration.
Feature
P
O
V
W
A
Wii [%]
98.8
98.8
98.3
98.8
98.0
Vive [%]
89.3
97.0
96.1
95.8
—
HMM w/o [%]
97.6
98.5
98.2
97.7
98.5
HMM n [%]
97.8
98.8
98.5
98.1
98.4
is user-dependent recognition. The network is trained on
half of samples of each class of gestures of a single tester.
Then it is tested against the remaining samples of the same
tester. The experiment is repeated for each of the testers.
The second scenario is user-independent recognition. The
network is trained on all gesture samples of selected 5 righthanded testers. Then it is tested against all other samples in
given database. The experiment is repeated for 100 random
combinations of 5 tester groups and the results are averaged.
This scenario is designed to show how well the recognizer
performs with limited input samples and generalizes to
larger data sets. This experiment has been performed for
both databases, but only on gestures common to both. The
experiment has been performed with each separate feature
and with feature sets depending on database used. For the
Wii database they exactly the same as in [4] and are as
follows:
•
•
•
•
•
•
PV - position and velocity
AW - acceleration and angular velocity
AWO - acceleration, angular velocity and orientation
PVO - position, velocity and orientation
PVOW - position, velocity, orientation and angular
velocity
PVOWA- position, velocity, orientation, angular velocity and acceleration
For the Vive database they are as follows.
•
•
•
PV - absolute position and velocity
PVO - absolute position and rotation
POVW - position, orientation, velocity and angular
velocity
The second experiment is designed to evaluate the performance of our recognizer with increasing complexity and
number of gesture classes. It is performed only on the Vive
database in three stages. First only the 20 gestures common
to both databases are used. Then only the new gestures
are considered. Finally all gestures are used in experiment.
We can evaluate how our solution handles more complex
gestures and how increasing the number of gestures classes
impacts the network’s performance. Similarly both userdependent and user-independent scenarios are considered.
Only the full feature set (PVOW - position, velocity, orientation and angular velocity) is used in this experiment.
5
5.1
R ESULTS
Comparison to HMM
Table 1 contains the result of the first experiment comparing
the proposed neural network to HMM solution when using
TABLE 2
User-independent recognition rates in percent of proposed neural
network for single features for both databases and comparison to HMM
results without (HMM w/o) and with normalization (HMM n). P position, O - orientation, V - velocity, W - angular velocity, A acceleration.
Feature
P
O
V
W
A
Wii [%]
86.2
86.6
89.9
84.4
80.3
Vive [%]
81.9
80.3
93.5
91.9
—
HMM w/o [%]
88.7
72.6
82.0
69.8
71.5
HMM n [%]
88.6
88.9
91.3
80.8
88.6
TABLE 3
User-dependent recognition rates in percent with various feature sets
for both databases. PV - position and velocity, AW - acceleration and
angular velocity, AWO - acceleration, angular velocity and orientation,
PVO - position, velocity and orientation, PVOW - position, velocity,
orientation and angular velocity, PVOWA - position, velocity, orientation,
angular velocity and acceleration.
Feature set
PV
AW
AWO
PVO
PVOW
PVOWA
Wii [%]
98.9
98.6
98.7
99.4
98.8
98.8
Vive [%]
97.4
—
—
97.1
97.1
—
only single features in user-dependent scenario. Our recognizer performs on par with HMM solution both with and
without normalization. The recognition rates are above 98%
in the case of Wii database. It is suspected that in the case of
user-dependent recognition the intra-class variance is low
enough that normalization has very little effect and both
systems can perform equally well. The recognition rates
obtained using Vive database are somewhat lower and the
cause is probably related to the tracking technology.
Table 2 contains analogous results of the userindependent test case. It can be clearly seen that our solution
can correctly label a high percentage of samples. The recognition rates are usually higher than those of HMM using not
normalized data. Our solution performs on par with HMM
using normalized data. In one case it demonstrates that it
can perform even better than HMM with normalized data.
Despite no knowledge of the underlying data format the
neural network is capable of generalizing the gestures with
regard to variations introduced by each person’s unique
way of executing gestures, thus rendering normalization
unnecessary.
Tables 3 and 4 contain the result of the first experiment
comparing the proposed neural network to HMM system
using combined features. In user-dependent scenario the
neural network performs well with recognition rates above
98%. It can be seen that the choice of feature set has marginal
effect on the success rate and in practice any combination
of them could be used. The differences between the two
recognizers are within 1% and there is no clear winner
between our solution and HMM system using normalized
data. In some cases our solution performs better, in others
the HMM is superior. In user-independent scenario the
results are in favour of the HMM system.
ROMAN CHOMIK et al.: GESTURE RECOGNITION IN VIRTUAL REALITY ENVIRONMENTS
TABLE 4
User-independent recognition rates in percent with various feature sets
for both databases. PV - position and velocity, AW - acceleration and
angular velocity, AWO - acceleration, angular velocity and orientation,
PVO - position, velocity and orientation, PVOW - position, velocity,
orientation and angular velocity, PVOWA - position, velocity, orientation,
angular velocity and acceleration.
Feature set
PV
AW
AWO
PVO
PVOW
PVOWA
Wii [%]
91.9
88.3
88.9
92.8
90.0
91.4
Vive [%]
97.1
—
—
94.4
97.7
—
TABLE 5
Recognition rates in percent for simple, complex and combined gesture
classes in user-dependent (UD) and user-independent (UI)
classification
Gesture set
Simple
Complex
Combined
UD [%]
97.2
98.3
96.4
UI [%]
92.5
97.6
94.9
5.2 Effect of gesture complexity and number of distinct
classes
The availability of a gesture database containing both simple
and complex gestures allowed us to compare the neural network’s ability to recognize each. It also lets us study how the
number of gesture classes affects the success rate. The results
are contained in table 5. It can be seen that our solution can
successfully learn and recognize complex gestures. In fact
the recognition rates are higher in case of complex gestures
than in case of simple ones. It can be presumed that complex
gestures contain more information which increases interclass variance thus making them easier to correctly classify.
The results using combined gesture classes show that the
number of gesture classes has little effect on the recognition
rates as long as inter-class variance is sufficiently large. This
means that gestures should differ significantly, so that it is
difficult for a user to perform an ambiguous gesture.
6
5
time. Our main contribution is the neural network architecture for VR gesture recognition that is extremely flexible.
It provides high recognition rates with varying number of
gesture classes, input data format and feature count. This
makes it suitable for rapid development of gesture based
user interfaces utilising modern motion controllers as well
as legacy hardware.
R EFERENCES
D. Rubine, Specifying gestures by example SIGGRAPH Comput.
Graph., vol. 25, pp. 329337, Jul. 1991.
[2] Mingyu Chen, Ghassan AlRegib, Biing-Hwang Juang: A new 6D
motion gesture database and the benchmark results of feature-based
statistical recognition Emerging Signal Processing Applications
(ESPA), 2012 IEEE International Conference on, pp. 131-134, 2012
[3] Mingyu Chen, Ghassan AlRegib, Biing-Hwang Juang: 6D motion
gesture recognition using spatio-temporal features, Acoustics Speech
and Signal Processing (ICASSP) 2012 IEEE International Conference on, pp. 2341-2344, 2012, ISSN 1520-6149
[4] Mingyu Chen, Ghassan AlRegib, Biing-Hwang Juang: Feature Processing and Modeling for 6D Motion Gesture Recognition, Multimedia
IEEE Transactions on, vol. 15, pp. 561-571, 2013, ISSN 1520-9210
[5] Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic
Optimization, 3rd International Conference for Learning Representations, San Diego, 2015
[6] Yann LeCunn, Léon Bottoun, Yoshua Bengio, Patrick Haffner:
Gradient-based learning applied to document recognition, Proc. of the
IEEE, November 1998
[7] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton: ImageNet
Classification with Deep Convolutional Neural Networks, Advances
in Neural Information Processing Systems 25 (NIPS 2012)
[8] Sepp Hochreiter: Long Short-Term Memory, Neural Computation,
vol. 9 issue 8, November 15, 1997, pp. 1735-1780
[9] Yarin Gal, Zoubin Ghahramani: A Theoretically Grounded Application of Dropout in Recurrent Neural Networks, 30th Conference on
Neural Information Processing Systems (NIPS 2016), Barcelona,
Spain.
[10] Michael Hoffman, Paul Varcholik, Joseph J. LaViola Jr.: Breaking
the Status Quo: Improving 3D Gesture Recognition with Spatially
Convenient Input Devices, Virtual Reality Conference (VR), 2010
IEEE
[11] TensorFlow. https://www.tensorflow.org/, January 20, 2017
[1]
C ONCLUSION
We show that our solution successfully overcomes several
challenges. The neural network is capable of learning and
extracting features from different feature sets including both
absolute and inertial tracking data. With as little as any
single feature it is capable of over 98% recognition rate
in user-dependent recognition. Moreover data requires no
preprocessing and can be feed as is. This means that no
additional work is required to decode and normalize data
from motion controllers. This makes our recognizer suitable
for creating personalized user interfaces based on motion
gestures. Any device capable of reading the motion of user’s
hand can be potentially used as an input device.
In user-independent scenarios our solution performs
with around 90% accuracy regardless of device used and
feature set available when trained with data from 5 users.
Our solution can be applied to different types of motion
gestures without modifying the neural network architecture.
It can handle both simple and complex gestures at the same
Roman Chomik Lodz University of Technology department of Technical Physics, Information Technology and Applied Mathematics undergraduate specializing in Game and Computer Simulations Technology. Currently continuing graduate studies at Lodz University of Technology with the speciality of Game Technology
and Interactive Systems.
6
Jarosław Andrzejczak Assistant Professor at
the Institute of Information Technology, Faculty of
Faculty of Technical Physics, Information Technology and Applied Mathematics, Lodz University of Technology. In 2015 he received PhD
degree in computer science for interactive information visualization for digital data sets search
results. His research interests encompass: usability testing and engineering, User Experience,
data and information visualization, user interface
design (including game interface design) as well
as application of information visualization in UI design.
COMPUTER GAME INNOVATIONS, 2017