Academia.eduAcademia.edu

Real-time human action recognition based on depth motion maps

2013, Journal of Real-Time Image Processing (JRTIP)

This paper presents a human action recognition method by using depth motion maps. Each depth frame in a depth video sequence is projected onto three orthogonal Cartesian planes. Under each projection view, the absolute difference between two consecutive projected maps is accumulated through an entire depth video sequence forming a depth motion map. An L2 regularized collaborative representation classifier with a distance-weighted Tikhonov matrix is then employed for action recognition. The developed method is shown to be computationally efficient allowing it to run in real-time. The recognition results applied to the Microsoft Research Action3D dataset indicate superior performance of our method over the existing methods

J Real-Time Image Proc DOI 10.1007/s11554-013-0370-1 ORIGINAL RESEARCH PAPER Real-time human action recognition based on depth motion maps Chen Chen • Kui Liu • Nasser Kehtarnavaz Received: 8 April 2013 / Accepted: 25 July 2013 Ó Springer-Verlag Berlin Heidelberg 2013 Abstract This paper presents a human action recognition method by using depth motion maps (DMMs). Each depth frame in a depth video sequence is projected onto three orthogonal Cartesian planes. Under each projection view, the absolute difference between two consecutive projected maps is accumulated through an entire depth video sequence forming a DMM. An l2-regularized collaborative representation classifier with a distance-weighted Tikhonov matrix is then employed for action recognition. The developed method is shown to be computationally efficient allowing it to run in real-time. The recognition results applied to the Microsoft Research Action3D dataset indicate superior performance of our method over the existing methods. Keywords Human action recognition  Depth motion map  RGBD camera  Collaborative representation classifier 1 Introduction Human action recognition is an active research area in computer vision. Earlier attempts at action recognition have involved using video sequences captured by video cameras. Spatio-temporal features are widely used for recognizing human actions, e.g. [1–6]. As imaging technology advances, it has become possible to capture depth information in real-time. Compared with conventional images, depth maps are insensitive to changes in lighting conditions and can provide 3D information toward distinguishing actions that are difficult to characterize using C. Chen (&)  K. Liu  N. Kehtarnavaz The University of Texas at Dallas, Richardson, TX, USA e-mail: [email protected] conventional images. Figure 1 shows two examples consisting of nine depth maps of the action Golf swing and the action Forward kick. Since the release of low cost depth sensors, in particular Microsoft Kinect and ASUS Xtion, many research works have been carried out on human action recognition using depth imagery, e.g. [7–13]. As noted in [14], 3D joint positions of a person’s skeleton estimated from depth images provide additional information to achieve action recognition. In this paper, the problem of human action recognition from depth map sequences is examined from the perspective of computational efficiency. These images are captured by an RGBD camera. Specifically, the depth motion maps (DMMs) generated by accumulating motion energy of projected depth maps in three projective views (front view, side view, and top view) are used as feature descriptors. Compared with 3D depth maps, DMMs are 2D images that provide an encoding of motion characteristics of an action. Motivated by the success of sparse representation in face recognition [15–18] and image classification [18, 19], an l2-regularized collaborative representation classifier is utilized which seeks a match of an unknown sample via a linear combination of training samples from all the classes. The class label is then derived according to the class which best approximates the unknown sample. Basically, our introduced method involves a spatio-temporal motion representation based on DMMs followed by an l2regularized collaborative representation classifier with a distance-weighted Tikhonov matrix to perform computationally efficient action recognition. The rest of the paper is organized as follows. In Sect. 2, related works are presented. In Sect. 3, the details of generating DMMs feature descriptors are stated. In Sect. 4, the sparse representation classifier (SRC) is first introduced and then the l2-regularized collaborative representation classifier is described for performing action recognition. 123 J Real-Time Image Proc Fig. 1 Examples of depth map sequences for a Golf swing action, and b Forward kick action The experimental results are reported in Sect. 5. Finally, in Sect. 6, concluding remarks are stated. 2 Related works Space–time based methods such as space–time volumes, spatio-temporal features, and trajectories have been widely utilized for human action recognition from video sequences captured by traditional RGB cameras. In [1], spatio-temporal interest points coupled with an SVM classifier was used to achieve human action recognition. Cuboid descriptors were employed in [2] for action representation. In [3], SIFT-feature trajectories modeled in a hierarchy of three abstraction levels were used to recognize actions in video sequences. Various local motion features were gathered as spatio-temporal bag-offeatures (BoF) in [4] to perform action classification. Motionenergy images (MEI) and motion-history images (MHI) were introduced in [5] as motion templates to model spatial and temporal characteristics of human actions in videos. In [6], a hierarchical extension for computing dense motion flow from MHI was presented. A major shortcoming associated with using these intensity-based or color-based methods is the sensitivity of recognition to illumination variations, limiting the recognition robustness. With the release of RGBD sensors, research into action recognition based on depth information has grown. Skeleton-based approaches utilize locations of skeletal joints extracted from depth images. In [7], a view invariant posture representation was devised using histograms of 3D joint locations (HOJ3D) within a modified spherical coordinate system. HOJ3D were re-projected using LDA and clustered into k posture visual words. The temporal evolutions of these visual words were modeled by a discrete hidden Markov model. In [8], a Naive-Bayes-NearestNeighbor (NBNN) classifier was employed to recognize 123 human actions based on Eigen Joints (i.e., position differences of joints) combining static posture, motion, and offset information. Such skeleton-based approaches have limitations due to inaccuracies in skeletal estimation. Moreover, the skeleton information is not always available in many applications. There are methods that involve extracting spatio-temporal features from the entire set of points in a depth map sequence to distinguish different actions. An action graph was employed in [9] to model the dynamics of actions and a collection of 3D points were used to characterize postures. However, the 3D points sampling scheme used generated a large amount of data leading to a computationally expensive training step. In [10], a DMM-based histogram of oriented gradients (HOG) was utilized to compactly represent the body shape and movement information toward distinguishing actions. In [11], random occupancy pattern (ROP) features were extracted from depth images using a weighted sampling scheme. A sparse coding approach was utilized to robustly encode ROP features for action recognition and the features were shown to be robust to occlusion. In [12], 4D space–time occupancy patterns were used as features which preserved spatial and temporal contextual information coping with intra-class variations. A simple classifier based on the cosine distance was then used for action recognition. In [13], a hybrid solution combining skeleton and depth information was used for action recognition. 3D joint position and local occupancy patterns were used as features. An actionlet ensemble model was then learnt to represent each action and to capture intra-class variations. In general, the above references do not elaborate on the computational complexity aspect of their solutions and do not provide actual real-time processing times. In contrast to the existing methods, in this work, both the computational complexity and the processing times associated with each component of our method are reported. 3 Depth motion maps as features A depth map can be used to capture the 3D structure and shape information. Yang et al. [10] proposed to project depth frames onto three orthogonal Cartesian planes for the purpose of characterizing the motion of an action. Due to its computational simplicity, the same approach in [10] is adopted in this work while the procedure to obtain DMMs is modified. More specifically, each 3D depth frame is used to generate three 2D projected maps corresponding to front, side, and top views, denoted by mapv where v 2 ff ; s; tg. For a point ðx; y; zÞ in a depth frame with z denoting the depth value in a right-handed coordinate system, the pixel value in three projected maps is indicated by z, x, and y, J Real-Time Image Proc respectively. Different from [10], for each projected map, the motion energy is calculated here as the absolute difference between two consecutive maps without thresholding. For a depth video sequence with N frames, DMMv is obtained by stacking the motion energy across an entire depth video sequence as follows: DMMv ¼ b X mapiv  mapi1 ; v ð1Þ i¼a where i represents the frame index; mapiv is the projected map of ith frame under projection view v; a 2 f2; . . .; Ng and b 2 f2; . . .; Ng denote the starting frame and the end frame index. It should be noted that not all the frames in a depth video sequence are used to generate DMMs. This point is discussed further in the experimental setup section. A bounding box is then set to extract the non-zero region as the foreground in each DMM. Let the foreground extracted DMM be denoted by DMMv hereafter. Two examples of DMMv generated from the Tennis serve and Forward kick video sequences are shown in Fig. 2. DMMs from the three projection views effectively capture the characteristics of the motion in a distinguishable way. That is the reason here for using DMMs as feature descriptors for action recognition. Since DMMv of different action video sequences may have different sizes, bicubic interpolation is used to resize all DMMv under the same projection view to a fixed size in order to reduce the intraclass variability, for example due to different subject heights. The size of DMMf is mf  nf , the size of DMMs is ms  ns , and the size of DMMt is mt  nt . Since pixel values are used as features, they are normalized between 0 and 1 to avoid large pixel values dominating the feature set. The resized and normalized DMM is denoted by DMMv . For an action video sequence with three DMMs, a feature vector of size  mf  nf þ ms  ns þ mt  nt  1 is thus formed to be h ¼    T vec DMMf ; vec DMMs ; vec DMMt by concatenating the three vectorized DMMs; vecðÞ indicates the vectorization operator and T the matrix transpose. The feature vector encodes the 4D characteristics of an action video sequence. Note that the HOG descriptors of the DMMs are not computed here as done in [10] and image resizing is applied to DMMs but not to each projected depth map as done in [10]. As a result, the computational complexity of the feature extraction process is greatly reduced. 4 l2-regularized collaborative representation classifier Sparse representation (or sparse coding) has been an active research area in the machine learning community due to its success in face recognition [15–18] and image classification [18, 19]. The central idea of the SRC is to represent a test Fig. 2 DMMv generated from a Tennis serve, and b Forward kick depth action video sequences sample according to a small number of atoms sparsely chosen out of an over-complete dictionary formed by all the available training samples. Consider a dataset with C classes of training samples arranged column-wise A ¼ ½A1 ; A2 ; . . .; AC  2 Rdn , where Aj ðj ¼ 1; . . .; CÞ is the subset of the training samples associated with class j, d is the dimension of training samples and n is the total number of training samples from all the classes. A test sample g 2 Rd can be represented as a sparse linear combination of the training samples, which can be formulated as, g ¼ Aa; ð2Þ where a ¼ ½a1 ; a2 ; . . .; aC  is an n  1 vector of coefficients corresponding to all the training samples and aj ðj ¼ 1; . . .; CÞ denotes the subset of the coefficients associated with the training samples from the jth class, i.e. Aj . From a practical standpoint, one cannot directly solve for a since (2) is typically under-determined [17]. To reach a solution, one can solve the following l1 norm minimization problem, n o a^ ¼ arg min kg  Aak22 þhkak1 ; ð3Þ a where h is a scalar regularization parameter which balances the influence of the residual and the sparsity term. The class label of g is then obtained via,  classðgÞ ¼ arg min ej ð4Þ j where ej ¼ g  Aj a^j 2 . The reader is referred to [15] for more details. As described in [20], it is the collaborative representation, i.e. use of all the training samples as a dictionary, but not the 123 J Real-Time Image Proc l1-norm sparsity constraint, that improves the classification accuracy. The l2-regularized approach generates comparable results but with significantly lower computational complexity [20, 21]. Therefore, here the l2-regularized approach is used for action recognition. As mentioned in Sect. 3, each depth video sequence generates a feature vector h 2 Rmf nf þms ns þmt nt , therefore the dictionary is A ¼ ½h1 ; h2 ; . . .; hK  with K being the total number of available training samples from all the action classes. Let yq 2 Rmf nf þms ns þmt nt denote the feature vector of an unknown action sample. Tikhonov regularization [22] is employed here to calculate the coefficient vector according to, n o 2 a^ ¼ arg min yq  Aa 2 þkkLak22 ; ð5Þ a where L is the Tikhonov regularization matrix and k is the regularization parameter. The term L allows the imposition of prior knowledge on the solution. Normally, L is chosen to be the identity matrix. The approach proposed in [23] is adopted here by giving less weight to the situations which are dissimilar from the unknown sample than those which are similar. Specifically, a diagonal matrix L in the following form is considered. 2 3 y q  h1 2 0 6 7 .. L¼4 ð6Þ 5: . 0 yq  h K 2 The coefficient vector a^ is calculated as follows [24]: 1 ð7Þ a^ ¼ AT A þ kLT L AT yq : The class label for each unknown sample is then found from (4). Algorithm 1 provides more details of the l2-regularized collaborative representation classifier utilized . 123 5 Experimental setup In this section, it is explained how our method was applied to the public domain Microsoft Research (MSR) Action3D dataset [9] with the depth map sequences captured by an RGBD camera. Our method is then compared with the existing methods. The MSR-Action3D dataset includes 20 actions performed by 10 subjects. Each subject performed each action 2 or 3 times. Each subject performed the same action differently. As a result, the dataset incorporated the intra-class variation. For example, the speed of performing an action varied with different subjects. The resolution of each depth map was 320 9 240. To facilitate a fair comparison, the same experimental settings as done in [7–10, 12] were considered. The actions were divided into three subsets as listed in Table 1. For each action subset, three different tests were performed. In Test One, 1/3 of the samples were used as training samples and the rest as test samples; in Test Two, 2/3 of the samples were used as training samples and the rest as test samples; in Cross Subject Test (or Test Three), half of the subjects were used as training and the rest as test subjects. In the experimental setup reported in [9], in Test One (or Two), for each action and each subject, the first (or first two) action sequences were used for training; while in Cross Subject Test, subjects 1, 3, 5, 7, 9 (if existed) were used for training. Noting that the samples or subjects used for training and testing were fixed, they are referred to as Fixed Tests here. Another experiment was conducted by randomly choosing training samples or training subjects corresponding to the three tests. In other words, the action sequences of each subject for each action were randomly chosen to serve as training samples in Test One and Test Two. For Cross Subject Test, half of the subjects were randomly chosen for training and the rest used for testing. These tests are referred to as Random Tests here. For each depth video sequence, the first five frames and the last five frames were removed and the remaining frames were used to generate DMMv . The purpose of this frame removal was two-fold. First, at the beginning and the end, the subjects were mostly at a stand-still position with only small body movements, which did not contribute to the motion characteristics of the video sequences. Second, in our process of generating DMMs, small movements at the beginning and the end resulted in a stand-still body shape with large pixel values along the edges which contributed to a large amount of reconstruction error. Therefore, the initial and end frame removal was done to remove no motion condition. Other frame selection methods may be used here to achieve the same. To have three fixed sizes for DMMv , the sizes of the DMMs of all the samples (training and test samples) were J Real-Time Image Proc Table 1 Three subsets of actions used for msr-action3D dataset Action set 1 (AS1) Action set 2 (AS2) Action set 3 (AS3) Horizontal wave (2) High wave (1) High throw (6) Hammer (3) Hand catch (4) Forward kick (14) Forward punch (5) Draw x (7) Side kick (15) High throw (6) Draw tick (8) Jogging (16) Hand clap (10) Draw circle (9) Tennis swing (17) Bend (13) Two hand wave (11) Tennis serve (18) Tennis serve (18) Forward kick (14) Golf swing (19) Pickup throw (20) Side boxing (12) Pickup throw (20) found under each projection view. The fixed size of each DMM was simply set to half of the mean value of all the sizes. For the training feature set and the test feature set, principal component analysis (PCA) was applied to reduce the dimensionality. The PCA transform matrix was calculated using the training feature set and then applied to the test feature set. This dimensionality reduction step provided computational efficiency for the classification. In our experiments, the largest 85 % of the eigenvalues were kept. Fig. 3 Recognition rates of Fixed Cross Subject Test for various values of k 5.1 Parameter selection In the l2-regularized collaborative representation classifier, a key parameter is k which controls the relative effect of the Tikhonov regularization term in the optimization stated in (5). Many approaches have been presented in the literature—such as the L-curve [25], discrepancy principle, and generalized cross-validation (GCV)—for finding an optimal value for this regularization parameter. To find an optimal k, a set of values were examined. Figure 3 shows the recognition rates with different values of k for Fixed Cross Subject Test. Random Cross Subject Test was also performed with the same set of values. For each value of k, the testing was repeated 50 times. The average recognition rates are shown in Fig. 4. From Figs. 3 and 4, one can see that the recognition accuracy was quite stable for a large range of k values. As a result, in all the experiments reported here, the value of k ¼ 0:001 was thus chosen. 5.2 Rejection option An option was added to reject an action which did not belong to an action set. For example, since action Jump was not included in the MSR-Action3D dataset, it was rejected. This was done by setting a rejection threshold for the minimum reconstruction error calculated from (4). This threshold was set according to the degree of similarity of the action not included in the recognition set. Let emin indicate the minimum reconstruction error, the Fig. 4 Average recognition rates of Random Cross Subject Test for various values of k decision of rejecting or accepting an unknown action sample was made as follows:  Reject; if emin [ threshold Decision (action) ¼ Accept; otherwise ð8Þ To find an appropriate rejection threshold, Random Tests on the MSR-Action3D dataset were done by repeating each test for each subset 200 times. Noting KT to be the total number of test samples in a subset for a random test, 200 test trials generated 200  KT minimum reconstruction errors thus forming a vector E ¼  1  200KT emin ; e2min ; . . .; emin . The mean of E was then calculated and used as the rejection threshold. 123 J Real-Time Image Proc 5.3 Results and discussion 5.3.1 Recognition results Our method was compared with the existing methods using the MSR-Action3D dataset. The comparison results are reported in Table 2. The best recognition rate achieved is highlighted in bold. From Table 2, it can be seen that our method outperformed the method reported in [9] in all the test cases. For the challenging Cross Subject Test, our method produced 90.5 % recognition rate which was slightly lower than the method reported in [10]. However, it should be noted that our method did not require the calculation of HOG descriptors and thus it was computationally much more efficient. The confusion matrix of our method for Fixed Cross Subject Test is shown in Fig. 5. For a compact representation, numbers are used to indicate the actions listed in Table 1. There are three possible reasons for the misclassifications in the Cross Subject Test. First, large intra-class variations existed due to considerable variations in the same action performed by different subjects. Although the DMMv of all the samples were normalized to have the same sizes, the normalization could not eliminate the intra-class variations entirely. Second, the feature formed by DMMv did not exhibit enough discriminatory power to distinguish similar motions. For example, Hammer was confused with Forward punch and High throw was confused with Tennis serve since they had similar motion characteristics. In other words, the DMMv generated by these actions were similar. Finally, since our classification decision was based on the reconstruction errors of different training classes in (4), the class with the smallest reconstruction error was favored. Hence, a misclassification occurred when two actions were similar and the wrong class had a smaller reconstruction error. To verify that our method did not depend on specific training data, another experiment was done by randomly choosing training samples or training subjects for the three tests. Each test was run for each subset 200 times and the mean performance (mean accuracy ± standard deviation) was computed, see Table 3. For Test One and Test Two, the average recognition rate over the subsets was found comparable with the outcome shown in Table 2. In the Cross Subject Test, the average recognition rate dropped by about 10 % mainly due to a large intra-class variation. However, our method still achieved 80 % recognition rate overall which was higher than the rates reported in [7] and [9] using the Fixed Cross Subject Test. Furthermore, l1- regularized SRC (denoted by L1) and SVM [27] were considered in order to compare the recognition performance with our l2-regularized collaborative representation classifier (denoted by L2). These three classifiers were tested on the same training and test samples of the Random Cross Subject Test with 200 trials. The SPAMS toolbox [26] was employed to solve the optimization problem in (3) due to its fast implementation. Radial basis function (RBF) kernel was used for the SVM and its two parameters (penalty parameter and kernel width) were tuned for optimal recognition rates. The average recognition rates using the three classifiers are shown in Fig. 6. As exhibited in this figure, our l2-regularized collaborative representation classifier was on par with the SCR and consistently outperformed the SVM classifier in all the three subsets. A disadvantage of SVM was also the requirement to tune its two parameters. Table 2 Recognition rates (%) comparison of Fixed Tests for msr-action3D dataset Li et al. [9] Lu et al. [7] Yang et al. [8] Yang et al. [10] Vieira et al. [12] Our method Test one AS1 89.5 98.5 94.7 97.3 98.2 97.3 AS2 89.0 96.7 95.4 92.2 94.8 96.1 AS3 96.3 93.5 97.3 98.0 97.4 98.7 Average 91.6 96.2 95.8 95.8 96.8 97.4 Test two AS1 93.4 98.6 97.3 98.7 99.1 98.6 AS2 92.9 97.2 98.7 94.7 97.0 98.7 AS3 96.3 94.9 97.3 98.7 98.7 100 Average 94.2 Cross subject test 97.2 97.8 97.4 98.3 99.1 AS1 72.9 88.0 74.5 96.2 84.7 96.2 AS2 71.9 85.5 76.1 84.1 81.3 83.2 AS3 79.2 63.6 96.4 94.6 88.4 92.0 Average 74.7 79.0 82.3 91.6 84.8 90.5 123 J Real-Time Image Proc (a) (b) (c) Fig. 5 Confusion matrix of our method for Fixed Cross Subject Test. a Subset AS1. b Subset AS2. c Subset AS3 Table 3 Average and standard deviation recognition rates (%) of our method for msr-action3D dataset in Random Tests Test one Test two Cross subject test AS1 97.4 ± 0.9 98.5 ± 1.1 84.8 ± 4.4 AS2 96.1 ± 1.5 97.8 ± 1.4 67.8 ± 4.3 AS3 97.7 ± 1.2 98.9 ± 1.1 87.1 ± 3.7 Average 97.1 ± 1.2 98.4 ± 1.2 79.9 ± 4.1 Fig. 7 Real-time action recognition timeline Table 4 Average and standard deviation of processing time of the components of our method Components Fig. 6 Comparison of recognition rates (%) using different classifiers in Random Cross Subject Test 5.3.2 Real-time operation There are four main components in our method: projected depth map generation (three views) for each depth frame, DMMs feature generation, dimensionality reduction (PCA), and action recognition (l2-regularized collaborative representation classifier). Our real-time action recognition timeline is displayed in Fig. 7. The numbers in Fig. 7 indicate the main components in our method. The generation of the projected map and DMMs are executed right Processing time (ms) 1 2.0 ± 0.4/frame 2 3.3 ± 0.6/frame 3 2.5 ± 1.2/action sequence 4 1.8 ± 0.5/action sequence after each depth frame is captured while the dimensionality reduction and action recognition are performed after an action sequence gets completed. Since the PCA transform matrix is calculated using the training feature set, it can be directly applied to the feature vector of a test sample. Our code is written in Matlab and the processing time reported is for a PC with 2.67 GHz Intel Core I7 CPU with 4 GB RAM. The average processing time of each component is listed in Table 4. Note that the average number of depth frames in an action video sequence (after frame removal) is about 30. The computational complexity aspect of the major components involved in different methods are provided in 123 J Real-Time Image Proc Table 5 Computational complexity and speed-up performance Method Computational complexity of major components Li et al. [9] O(J 9 KhD2) Lu et al. [7] O(KhMP ? P3) ? O(NhH2) Yang et al. [8] O(m3 ? m2r) ? O(r 9 nc 9 nd 9 log(nc 9 nd)) Yang et al. [10] O(r3) Vieira et al. [12] O(m3 ? m2r) ? O(nc 9 r2) 20 Our method O(m3 ? m2r) ? O(nc 9 r) 24 Table 5. In [9], the bi-gram maximum likelihood decoding (BMLD) for Gaussian Mixture Model (GMM) was adopted to mitigate the computational complexity with the complexity of O(J 9 KhD2) [28], where J denotes the number of iterations, Kh the number of samples in the dataset and D the dimensionality of the state. As was reported in [7], the complexity is mostly due to the Fisher’s linear discriminant analysis (LDA) and HMM. The computation of voting of joints into the bins is relatively trivial. The computational complexity for LDA is O(KhMP ? P3), where M is the number of extracted features and P = min(Kh,M). The computational complexity for HMM is O(NhH2) [29], where Nh denotes the total number of states and H the length of the observation sequence. In [8], the computational complexity for PCA [30] and NaiveBayes-Nearest-Neighbor (NBNN) classifier is stated as O(m3 ? m2r) and O[r 9 nc 9 nd 9 log(nc 9 nd)], respectively, where m denotes the dimension of a sample vector, r denotes the number of training samples, nc represents the number of classes, and nd represents the number of descriptors. In [10], the computational complexity of SVM is stated as O(r3) [31]. In [12], the computational complexity of PCA and the classifier are stated as O(m3 ? m2r) and O(nc 9 r2), respectively. Table 5 provides the speedup for a typical set of parameters: J = 50, Kh = 200, D = 30, Nh = 6, H = 27, M = 125, r = 100, m = 50, nc = 8, nd = 40. As can be seen from this table, our method is the most computationally efficient one. 6 Conclusion In this paper, a computationally efficient DMM-based human action recognition method using l2-regularized collaborative representation classifier was introduced. The DMMs generated from three projection views were used to capture the motion characteristics of an action sequence. An average recognition rate of 90.5 % on the MSR-Action3D dataset was achieved, outperforming the existing methods. In addition, the utilization of l2-regularized collaborative representation classifier was shown to be computationally efficient leading to a real-time implementation. 123 Approximate speedup for a typical set of parameters 1 3 16 9 References 1. Schuldt, C., Laptev, I., Caputo, B.: Recognition human actions: a local SVM approach. Proceedings of IEEE International Conference on Pattern Recognition, vol. 3, pp. 32–36, Cambridge, UK (2004) 2. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72, Beijing, China (2005) 3. Sun, J., Wu, X., Yan, S., Cheong, L.F., Chua, T., Li, J.: Hierarchical spatio-temporal context modeling for action recognition. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2004–2011, Miami, FL (2009) 4. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, Anchorage, AK (2008) 5. Bobick, A., Davis, J.: The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 257–267 (2001) 6. Davis, J.: Hierarchical motion history images for recognizing human motion. Proceedings of IEEE Workshop on Detection and Recognition of Events in Video, pp. 39–46, Vancouver, BC (2001) 7. Xia, L., Chen, C., Aggarwal, J.-K.: View invariant human action recognition using histograms of 3D joints. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–27, Providence, RI (2012) 8. Yang X., Tian, Y.: Eigen joints-based action recognition using Naı̈ve-Bayes-Nearest-Neighbor. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 14–19, Province, RI (2012) 9. Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 9–14, San Francisco, CA (2010) 10. Yang, X., Zhang, C., Tian, Y.: Recognizing actions using depth motion maps-based histograms of oriented gradients. Proceedings of ACM International Conference on Multimedia, pp. 1057–1060, Nara, Japan (2012) 11. Wang, J., Liu, Z., Chorowski, J., Chen, Z., Wu, Y.: Robust 3D action recognition with random occupancy patterns. Proceedings of IEEE European Conference on Computer Vision, pp. 872–885, Florence, Italy (2012) 12. Vieira, A., Nascimento, E., Oliveira, G., Liu, Z., Campos, M.: Stop: Space-time occupancy patterns for 3D action recognition from depth map sequences. Iberoamerican Congress on Pattern Recognition, pp. 252–259, Buenos Aires, Argentina (2012) 13. Jiang, W., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1290–1297, Province, RI (2012) J Real-Time Image Proc 14. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1297–1304, Colorado springs, CO (2011) 15. Wright, J., Yang, A., Ganesh, A., Sastry, S., Ma, Y.: Robust face recognition via sparse representation. IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227 (2009) 16. Wright, J., Ma, Y.: Dense error correction via l1 minimization. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3033–3036, Taipei, Taiwan (2009) 17. Wright, J., Ma, Y., Mairal, J., Sapiro, G., Huang, T., Yan, S.: Sparse representation for computer vision and pattern recognition. Proceedings of the IEEE, vol. 98, no. 6, pp. 1031–1044 (2010) 18. Gao, S., Tsang, I.W.-H., Chia, L.: Kernel sparse representation for image classification and face recognition. Proceedings of IEEE European Conference on Computer Vision, pp. 1–14, Crete, Greece (2010) 19. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1794–1801, Miami, FL (2009) 20. Zhang, L., Yang, M., Feng, X.: Sparse representation or collaborative representation: which helps face recognition? Proceedings of IEEE International Conference on Computer Vision, pp. 471–478, Barcelona, Spain (2011) 21. Shi, Q., Eriksson, A., Hengel, A., Shen, C.: Is face recognition really a compressive sensing problem? Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 553–560, Colorado springs, CO (2011) 22. Tikhonov. A., Arsenin, V.: Solutions of Ill-Posed Problems. V. H. Winston & Sons, Washington, DC (1977) 23. Chen, C., Tramel, E., Fowler, J.: Compressed-sensing recovery of images and video using multi hypothesis predictions. Proceedings of Asilomar Conference on Signals, Systems, and Computer, pp. 1193–1198, Pacific Grove, CA (2011) 24. Golub, G., Hansen, P.C., O’Leary, D.: Tikhonov regularization and total least squares. SIAM J Matrix Anal. Appl. 21(1), 185–194 (1999) 25. Hansen, P., O’Leary, D.: The use of the L-curve in the regularization of discrete ill-posed problems. SIAM J Sci. Comput. 14(6), 1487–1503 (1993) 26. Mairal, J.: (SPArse Modeling Software), spams-devel.gforge.inria.fr 27. Chang, C., Lin, C.: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, pp. 27:1–27:27 (2011). Software available at http://www.csie.ntu.edu.tw/*cjlin/libsvm/ 28. Li, W., Zhang, Z., Liu, Z.: Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Transactions on Circuits and Systems for Video Technology, vol. 18, no. 11, pp. 1499–1510 (2008) 29. Fine, S., Singer, Y., Tishby, N.: The hierarchical hidden Markov model: analysis and applications. Mach. Learn. 32(1), 41–62 (1998) 30. Liu, K., Ma, B., Du, Q., Chen, G.: Fast Motion Detection from Airborne Videos Using Graphics computing units. J.Appl. Remote Sens. vol. 6, no. 1 (2012) 31. Tsang, I., Kwok, J., Cheung, P.-M.: Core vector machines: Fast SVM training on very large data sets. J. Mach. Learn. Res. vol. 6, no. 1, pp. 363–392 (2005) Author Biographies Chen Chen received the B.E. degree in automation from Beijing Forestry University, Beijing, China, in 2009 and the M.S. degree in electrical engineering from Mississippi State University, Starkville, MS, in 2012. He is currently working toward the Ph.D. degree in the Department of Electrical Engineering, University of Texas at Dallas, Richardson, TX. His research interests include compressed sensing, signal and image processing, pattern recognition, and computer vision. Kui Liu received the B.S. degree in electrical engineering from Nanchang University, Nanchang, China in 2005, and received the M.S. degree in electrical engineering from Mississippi State University, Starkville, M.S. in 2011. He is currently a graduate research assistant in the Department of Electrical Engineering at the University of Texas at Dallas as a member of Signal and Image processing Laboratory. His research interests include real-time image processing, 3-D computer vision and machine learning. Nasser Kehtarnavaz received the Ph.D. degree in electrical and computer engineering from Rice University in 1987. He is a Professor of Electrical Engineering and Director of the Signal and Image Processing Laboratory at the University of Texas at Dallas. His research areas include signal and image processing, real-time signal and image processing, biomedical image analysis, and pattern recognition. He has authored or co-authored 8 books and more than 200 papers in these areas. He is currently Chair of the Dallas Chapter of the IEEE Signal Processing Society, Cochair of the SPIE Conference on Real-Time Image and Video Processing, and Coeditor-in-Chief of Journal of Real-Time Image Processing. He is a fellow of IEEE and SPIE. 123