Gesture recognition in virtual reality environments

2017, Computer Game Innovations

COMPUTER GAME INNOVATIONS, 2017 1 Gesture recognition in virtual reality environments Roman Chomik, Jarosław Andrzejczak Abstract—In recent years there has been great progress in virtual reality systems utilizing 6D motion controllers, but there has been relatively little progress in user interfaces taking advantage of this technology. We present a motion gesture recognizer based on an artificial neural network. By utilizing both convolutional and recurrent layers we have developed a system which needs relatively few training samples to label a wide variety of classes. With reduced input feature count our solution provides comparable success rate to traditional hidden Markov model recognizer. This allows to use our system with motion controllers based on internal measurement units. Index Terms—gesture recognition, recurrent neural networks, virtual reality ✦ 1 I NTRODUCTION Gestures are a form of non-verbal communication in which intent is expressed via motion of hands, face or other body parts. Modern virtual reality systems often provide tracking of hands and head movement. This data can be used to infer person’s intent. Different devices provide varying precision and amount of data. Systems based on internal measurement units (IMU) provide only acceleration and angular velocity, while more advanced controllers, such as HTC Vive and Oculus Rift, provide absolute positional and rotational information. A gesture is described by the motion of a body part. Classifying a gesture requires comparing it to previously seen samples. For the purposes of gesture recognition we define a gesture as a series of data points. In general the speed at which a gesture is made does not affect it’s meaning. This means that gestures of the same class can vary in length. The features contained in a single data point heavily depend on the specifics of an input device. In this paper we propose the motion gesture recognizer based on an artificial neural network suitable for rapid development of gesture based user interfaces utilizing modern motion controllers as well as legacy hardware. The three contributions to gesture recognition in virtual reality environments research presented in this paper are: • • • Artificial neural network with high recognition rates with varying number of gesture classes and input data format. Recognizer that provides high success rate with reduced input feature count compared to traditional hidden Markov model [4]. Original database for testing gesture recognition in virtual reality containing over two thousand elements (30 gestures repeated 12 times each by 6 participants). Besides twenty gestures identical to those used in [2] ten new, more complex gestures were added. We start with a discussion of research motivation and related work in the next section. This is followed by a description of a neural network architecture and input data format. We then present the experimental setup details and tests results. Finally, analysis of results and final conclusions will be given. 2 R ELATED WORK There has been some research into motion gesture recognition. Solutions developed so far are based on linear classifiers, AdaBoost and hidden Markov model [10] [2] [3] [4]. Recognition rates vary depending on type of classifier, number of training samples and testing scenario (userdependent, user-independent1 ), but in general are claimed to be above 90%. Linear classifiers work on a principle of manually specifying features based on the interpretation of input data. They are generally based on research by Rubine [1]. Then those feature vectors are compared to reference samples using a linear classifier. This method requires some manual work to define features which give good results. Using 5 gesture samples linear classifier provides recognition rates of up to 96.3% and 95.6% in user-dependent and userindependent scenarios respectively [10]. Hidden Markov model based classifiers improve upon linear classifiers. They do not require prior data processing nor manual specification of features. Without data normalization results vary greatly. In order to give best result data should be normalized prior to feeding it into the classifier which requires the interpretation and analysis of data obtained from input device. Hidden Markov model based classifiers provide good recognition rates of up to 99.8% and 97.0% in user-dependent and user-independent scenarios respectively [4]. 3 N EURAL NETWORK ARCHITECTURE Our recognizer is based on an artificial neural network. Neural networks have been shown to succeed in task where 1. In user-dependent scenario both training and testing is performed using gesture samples obtained from the same tester. In userindependent scenario training and testing data sets are created using samples from different users or disjoint user groups. 2 COMPUTER GAME INNOVATIONS, 2017 traditional methods have performed poorly or required large workloads, such as speech recognition, language modelling or image analysis. Their main advantage is the ability to learn features, which best describe the nature of the problem. In case of gesture recognition it allows for specifying gestures by example rather than manually specifying features. This in turn makes it possible for the users themselves to add new gestures to the system without exact knowledge of how the system works. Our neural network has been designed to work well with a variety of input data formats in its raw form. The data is assumed not to be normalized nor in any other way processed. One of the neural network’s tasks is extraction of meaningful features from various sensor reading and data representations. It should be able to handle both inertial and absolute readings, and provide acceptable recognition rates with limited input data. To fulfil this task the neural network consist of both convolutional and recurrent layers. The neural network architecture is layered without skipping, i.e. each layer connects only to the next. It consists of a convolutional layer followed by a number of recurrent layers and a dense projection layer. Each recurrent layer consists of Long Short-Term Memory (LSTM) cells [8]. The projection layer accepts the last output of the recurrent layer and reduces the number of outputs to the number of classes of gestures. Its outputs can be directly interpreted as probabilities of class occurrence in input data. The neural network accepts inputs of varying length. The neural network consists of the following operations in order: 1) 2) 3) 4) 5) 6) 7) 8) 9) Convolution Bias Rectifier activation (ReLU) Dropout LSTM recurrent Dropout LSTM recurrent Dense Softmax activation Operations 1-3 are referred to as ”Convolutional layer”. Operations 4-7 are referred to as ”Recurrent layer”. Operations 8-9 are referred to as ”Projection layer” (Fig. 1). 3.1 Feeding data The input to our recognizer is represented by a series of vectors. To speed up training multiple gestures are fed in batches. Each vector contains different components recorded at a specific time-point, such as position vector, rotation matrix, etc. Different types are represented as vectors and appended. Vectors in the series are sampled at a constant frequency. As gestures can be of different lengths they are padded with zeros to the length of the longest example in set. This potentially wastes a significant portion of memory, but simplifies training by allowing arbitrary choice of examples in batches. Data in this form is fed directly into the neural network without any further preprocessing. 3.2 Convolutional layer Convolution has been successfully used in image classification problems [6] [7]. Their structure allows for extraction of local features that can occur in any position in input image. Similarly we have used a convolutional layer to learn local temporal features, i.e. features that occur within short time span. Examples of such features include, but are not limited to, specific change in velocity, position or orientation. Convolution can also potentially be used to detect correlations between different components of input data. Image data is usually two-dimensional (width and height) in nature with a number of color channels (usually 3). Our first dimension is time. In order to allow interactions between different components of input vector we treat components as channels rather than the second dimension. Thus our problem is one-dimensional (time) with number of channels equal to the number of input components. The first layer accepts the input data. It is treated as an array and convolution is performed. The convolution filter has a width of 4 and a stride of 1 is used. No data padding on edges is applied as it is difficult to perform any meaningful padding without exactly knowing data format and its interpretation, which is exactly the kind of dependency we are trying to avoid. The output of convolutional layer is shorter in length than its input, but contains many more channels. These channels represent learned short term temporal features at given time points in input stream. The next operations is adding a learned bias to each of the features. Then the ReLU activation function is applied. The convolution layer outputs 128 channels. 3.3 Recurrent layer Because gestures can be of different length and can be executed with different speeds, we need a method that can adapt to varying occurrence of features in input data. Recurrent neural networks have been used to recognize temporal patterns and as such have found use in speech and handwriting recognition, automated translation and language modeling. The recurrent neural network is created by connecting a layer’s output as its input in the next time step. The recurrent structure provides them with a limited form of memory. The neural network has a direct access to its current input and indirect access to features that have been available in previous time steps. Because of very large depth of such networks training them via gradient descent methods has been infeasible due to the problem of vanishing/exploding gradient. One solution to this problem is Long Short-Term Memory cells [8]. This works by either permitting entire signal to pass through to the next time step or to be entirely blocked. This results in gradients being passed exactly as is or blocked. A gesture can be thought of as a sequence of basic movements performed by a user, such as moving the hand in specific direction or twisting the wrist. In each gesture such sequence can be found which is common to all users regardless of the exact way a gesture is performed, the speed of execution or hand wielding the controller. Convolution extracts those basic movements whereas recurrent layer learns how those features contribute into higher level features. By stacking several recurrent layers the neural network is capable of learning entire gestures. The reccurent layer consists of a two sets of LSTM cells. Each of this sets has a dropout applied to its input to prevent ROMAN CHOMIK et al.: GESTURE RECOGNITION IN VIRTUAL REALITY ENVIRONMENTS 3 Fig. 1. The neural network operations order scheme showing Convolutional, Recurrent and Projection layer. overfitting and improve generalization. The dropout mask is preserved across all timesteps as described in [9]. Dropout keep probability is shared between both sets and has the value of 25% during training and 100% when used for inference. The number of LTSM cell is equal to 128 for each of the sets. The projection layer is directly used for gesture classification. Based on the last valid output the recurrent layer, the projection layer outputs confidence levels of gesture belonging to each class. The projection layer is comprised of a number of neurons equal to the number of output classes, thus varies depending of data set used. Finally the softmax function is applied. Due to the use of softmax function the sum of outputs is always equal to 1 and individual outputs can be interpreted as probabilities. 3.4 Training Supervised learning is used to train the model. Training is performed using gradient descent method. In order to provide faster convergence Adam optimizer is used with a learning rate of 0.001 [5]. Training is performed until a predefined training loss is achieved or model no longer converges for a specified number of epochs. No separate validation data set is used. Values of 0.05 and 20 respectively are used for target training loss and number of epochs without convergence. Training is performed on GPU. The size of batch is limited by available memory, thus depends on maximum gesture length and model complexity. In our case batch has a size of 200 gestures. Every epoch the training set is shuffled and new batches are selected. This prevents overfitting to a part of training data and improves generalization. 4 E XPERIMENTAL SETUP The aim of the performed tests was to examine proposed artificial neural network gesture recognition success rate and identify the minimal number of features needed to obtain high success rate results (90% and more). Additionally we study the effect of number of gesture classes anf their complexity. The neural network model has been implemented2 using TensorFlow library [11]. 2. Source code available GestureRecognitionVR at Fig. 2. Shapes of additional gestures in Vive database. Dots denote the beginning of gesture. Two databases of gesture samples were available for tests. The first database (named Wii) has been created by the authors of [2] and is publicly available. It contains samples generated by 28 testers aged 15-33. Each gesture was performed 10 times by each participant and belongs to one of 20 classes. The second database (named Vive) was created by our team with the assistance of 6 testers using HTC Vive motion controllers. Each participant was asked to perform 30 gestures 12 times each. The first 20 gestures classes are identical to the ones in [2], the remaining are more complex shapes shown on fig. 2. Due to different data formats, model trained on one database cannot be tested against the other and vice versa. We have created two experiment scenarios designed to evaluate different aspects of the neural network. The first experiment is designed to directly compare our proposed solution to existing HMM model. The second experiment studies the performance of neural network gesture recognizer with more complex shapes. The first experimental setup is based on the one described in [2] [3] [4]. This allows direct comparison of different gesture recognition solutions. The first scenario 4 COMPUTER GAME INNOVATIONS, 2017 TABLE 1 User-dependent recognition rates in percent of proposed neural network for single features for both databases and comparison to HMM results without (HMM w/o) and with normalization (HMM n). P position, O - orientation, V - velocity, W - angular velocity, A acceleration. Feature P O V W A Wii [%] 98.8 98.8 98.3 98.8 98.0 Vive [%] 89.3 97.0 96.1 95.8 — HMM w/o [%] 97.6 98.5 98.2 97.7 98.5 HMM n [%] 97.8 98.8 98.5 98.1 98.4 is user-dependent recognition. The network is trained on half of samples of each class of gestures of a single tester. Then it is tested against the remaining samples of the same tester. The experiment is repeated for each of the testers. The second scenario is user-independent recognition. The network is trained on all gesture samples of selected 5 righthanded testers. Then it is tested against all other samples in given database. The experiment is repeated for 100 random combinations of 5 tester groups and the results are averaged. This scenario is designed to show how well the recognizer performs with limited input samples and generalizes to larger data sets. This experiment has been performed for both databases, but only on gestures common to both. The experiment has been performed with each separate feature and with feature sets depending on database used. For the Wii database they exactly the same as in [4] and are as follows: • • • • • • PV - position and velocity AW - acceleration and angular velocity AWO - acceleration, angular velocity and orientation PVO - position, velocity and orientation PVOW - position, velocity, orientation and angular velocity PVOWA- position, velocity, orientation, angular velocity and acceleration For the Vive database they are as follows. • • • PV - absolute position and velocity PVO - absolute position and rotation POVW - position, orientation, velocity and angular velocity The second experiment is designed to evaluate the performance of our recognizer with increasing complexity and number of gesture classes. It is performed only on the Vive database in three stages. First only the 20 gestures common to both databases are used. Then only the new gestures are considered. Finally all gestures are used in experiment. We can evaluate how our solution handles more complex gestures and how increasing the number of gestures classes impacts the network’s performance. Similarly both userdependent and user-independent scenarios are considered. Only the full feature set (PVOW - position, velocity, orientation and angular velocity) is used in this experiment. 5 5.1 R ESULTS Comparison to HMM Table 1 contains the result of the first experiment comparing the proposed neural network to HMM solution when using TABLE 2 User-independent recognition rates in percent of proposed neural network for single features for both databases and comparison to HMM results without (HMM w/o) and with normalization (HMM n). P position, O - orientation, V - velocity, W - angular velocity, A acceleration. Feature P O V W A Wii [%] 86.2 86.6 89.9 84.4 80.3 Vive [%] 81.9 80.3 93.5 91.9 — HMM w/o [%] 88.7 72.6 82.0 69.8 71.5 HMM n [%] 88.6 88.9 91.3 80.8 88.6 TABLE 3 User-dependent recognition rates in percent with various feature sets for both databases. PV - position and velocity, AW - acceleration and angular velocity, AWO - acceleration, angular velocity and orientation, PVO - position, velocity and orientation, PVOW - position, velocity, orientation and angular velocity, PVOWA - position, velocity, orientation, angular velocity and acceleration. Feature set PV AW AWO PVO PVOW PVOWA Wii [%] 98.9 98.6 98.7 99.4 98.8 98.8 Vive [%] 97.4 — — 97.1 97.1 — only single features in user-dependent scenario. Our recognizer performs on par with HMM solution both with and without normalization. The recognition rates are above 98% in the case of Wii database. It is suspected that in the case of user-dependent recognition the intra-class variance is low enough that normalization has very little effect and both systems can perform equally well. The recognition rates obtained using Vive database are somewhat lower and the cause is probably related to the tracking technology. Table 2 contains analogous results of the userindependent test case. It can be clearly seen that our solution can correctly label a high percentage of samples. The recognition rates are usually higher than those of HMM using not normalized data. Our solution performs on par with HMM using normalized data. In one case it demonstrates that it can perform even better than HMM with normalized data. Despite no knowledge of the underlying data format the neural network is capable of generalizing the gestures with regard to variations introduced by each person’s unique way of executing gestures, thus rendering normalization unnecessary. Tables 3 and 4 contain the result of the first experiment comparing the proposed neural network to HMM system using combined features. In user-dependent scenario the neural network performs well with recognition rates above 98%. It can be seen that the choice of feature set has marginal effect on the success rate and in practice any combination of them could be used. The differences between the two recognizers are within 1% and there is no clear winner between our solution and HMM system using normalized data. In some cases our solution performs better, in others the HMM is superior. In user-independent scenario the results are in favour of the HMM system. ROMAN CHOMIK et al.: GESTURE RECOGNITION IN VIRTUAL REALITY ENVIRONMENTS TABLE 4 User-independent recognition rates in percent with various feature sets for both databases. PV - position and velocity, AW - acceleration and angular velocity, AWO - acceleration, angular velocity and orientation, PVO - position, velocity and orientation, PVOW - position, velocity, orientation and angular velocity, PVOWA - position, velocity, orientation, angular velocity and acceleration. Feature set PV AW AWO PVO PVOW PVOWA Wii [%] 91.9 88.3 88.9 92.8 90.0 91.4 Vive [%] 97.1 — — 94.4 97.7 — TABLE 5 Recognition rates in percent for simple, complex and combined gesture classes in user-dependent (UD) and user-independent (UI) classification Gesture set Simple Complex Combined UD [%] 97.2 98.3 96.4 UI [%] 92.5 97.6 94.9 5.2 Effect of gesture complexity and number of distinct classes The availability of a gesture database containing both simple and complex gestures allowed us to compare the neural network’s ability to recognize each. It also lets us study how the number of gesture classes affects the success rate. The results are contained in table 5. It can be seen that our solution can successfully learn and recognize complex gestures. In fact the recognition rates are higher in case of complex gestures than in case of simple ones. It can be presumed that complex gestures contain more information which increases interclass variance thus making them easier to correctly classify. The results using combined gesture classes show that the number of gesture classes has little effect on the recognition rates as long as inter-class variance is sufficiently large. This means that gestures should differ significantly, so that it is difficult for a user to perform an ambiguous gesture. 6 5 time. Our main contribution is the neural network architecture for VR gesture recognition that is extremely flexible. It provides high recognition rates with varying number of gesture classes, input data format and feature count. This makes it suitable for rapid development of gesture based user interfaces utilising modern motion controllers as well as legacy hardware. R EFERENCES D. Rubine, Specifying gestures by example SIGGRAPH Comput. Graph., vol. 25, pp. 329337, Jul. 1991. [2] Mingyu Chen, Ghassan AlRegib, Biing-Hwang Juang: A new 6D motion gesture database and the benchmark results of feature-based statistical recognition Emerging Signal Processing Applications (ESPA), 2012 IEEE International Conference on, pp. 131-134, 2012 [3] Mingyu Chen, Ghassan AlRegib, Biing-Hwang Juang: 6D motion gesture recognition using spatio-temporal features, Acoustics Speech and Signal Processing (ICASSP) 2012 IEEE International Conference on, pp. 2341-2344, 2012, ISSN 1520-6149 [4] Mingyu Chen, Ghassan AlRegib, Biing-Hwang Juang: Feature Processing and Modeling for 6D Motion Gesture Recognition, Multimedia IEEE Transactions on, vol. 15, pp. 561-571, 2013, ISSN 1520-9210 [5] Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization, 3rd International Conference for Learning Representations, San Diego, 2015 [6] Yann LeCunn, Léon Bottoun, Yoshua Bengio, Patrick Haffner: Gradient-based learning applied to document recognition, Proc. of the IEEE, November 1998 [7] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton: ImageNet Classification with Deep Convolutional Neural Networks, Advances in Neural Information Processing Systems 25 (NIPS 2012) [8] Sepp Hochreiter: Long Short-Term Memory, Neural Computation, vol. 9 issue 8, November 15, 1997, pp. 1735-1780 [9] Yarin Gal, Zoubin Ghahramani: A Theoretically Grounded Application of Dropout in Recurrent Neural Networks, 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. [10] Michael Hoffman, Paul Varcholik, Joseph J. LaViola Jr.: Breaking the Status Quo: Improving 3D Gesture Recognition with Spatially Convenient Input Devices, Virtual Reality Conference (VR), 2010 IEEE [11] TensorFlow., January 20, 2017 [1] C ONCLUSION We show that our solution successfully overcomes several challenges. The neural network is capable of learning and extracting features from different feature sets including both absolute and inertial tracking data. With as little as any single feature it is capable of over 98% recognition rate in user-dependent recognition. Moreover data requires no preprocessing and can be feed as is. This means that no additional work is required to decode and normalize data from motion controllers. This makes our recognizer suitable for creating personalized user interfaces based on motion gestures. Any device capable of reading the motion of user’s hand can be potentially used as an input device. In user-independent scenarios our solution performs with around 90% accuracy regardless of device used and feature set available when trained with data from 5 users. Our solution can be applied to different types of motion gestures without modifying the neural network architecture. It can handle both simple and complex gestures at the same Roman Chomik Lodz University of Technology department of Technical Physics, Information Technology and Applied Mathematics undergraduate specializing in Game and Computer Simulations Technology. Currently continuing graduate studies at Lodz University of Technology with the speciality of Game Technology and Interactive Systems. 6 Jarosław Andrzejczak Assistant Professor at the Institute of Information Technology, Faculty of Faculty of Technical Physics, Information Technology and Applied Mathematics, Lodz University of Technology. In 2015 he received PhD degree in computer science for interactive information visualization for digital data sets search results. His research interests encompass: usability testing and engineering, User Experience, data and information visualization, user interface design (including game interface design) as well as application of information visualization in UI design. COMPUTER GAME INNOVATIONS, 2017