Academia.eduAcademia.edu

High Performance Set of Features for Human Action Classification

The most common method for handling human action classification is to determine a common set of optimal features and then apply a machine-learning algorithm to classify them. In this paper we explore combining sets of different features for training an ensemble using random subspace with a set of support vector machines. We propose two novel descriptors for this task domain: one based on Gabor filters and the other based on local binary patterns (LBPs). We then combine these two sets of features with the histogram of gradients. We obtain an accuracy of 97.8% using the 10-class Weizmann dataset and a 100% accuracy rate using the 9-class Weizmann dataset. These results are comparable with the state of the art. By combining sets of relatively simple descriptors it is possible to obtain results comparable to using more sophisticated approaches. Our simpler approach, however, offers the advantage of being less computationally expensive.

980 Int'l Conf. IP, Comp. Vision, and Pattern Recognition | IPCV'09 | High Performance Set of Features for Human Action Classification 1 S Brahnam1 and L. Nanni2 Computer Information Systems, Missouri State University, Springfield, MO, USA 2 DEIS, IEIIT—CNR, Università di Bologna, Bologna, Italy Abstract - The most common method for handling human action classification is to determine a common set of optimal features and then apply a machine-learning algorithm to classify them. In this paper we explore combining sets of different features for training an ensemble using random subspace with a set of support vector machines. We propose two novel descriptors for this task domain: one based on Gabor filters and the other based on local binary patterns (LBPs). We then combine these two sets of features with the histogram of gradients. We obtain an accuracy of 97.8% using the 10-class Weizmann dataset and a 100% accuracy rate using the 9-class Weizmann dataset. These results are comparable with the state of the art. By combining sets of relatively simple descriptors it is possible to obtain results comparable to using more sophisticated approaches. Our simpler approach, however, offers the advantage of being less computationally expensive. Keywords: Human Action Classification; Local Binary Patterns; Gabor Filters; Histogram of Gradients; Machine Learning Techniques; Ensemble of Support Vector Machines. 1 Introduction Human action classification can be defined as the task of matching videos containing human motion to a set of action class labels. This is a field of study that has only recently, within the last couple of years, been seriously investigated. Automatic labeling of action in video sequences has value in a variety of video searching applications, such as, locating various sports plays and dance moves in sports and music videos and suspicious behaviors (such as running out of a bank) in surveillance video [1]. There are also a number of artistic and gaming applications, as well as human-computer communication applications that could benefit from matching human motion to a set of action labels. For several general surveys of human action analysis that mention the importance of this problem see [2-6]. Automated human action classification is a difficult machine classification problem. Some challenges include large variations in action performance produced by variations in people's anatomy, problems with differences in recording setups and environmental changes (including lightening, camera viewpoint, and background complexity), and spatial and temporal variations (including variations in the rate people perform actions [7]). Some significant research in human action analysis include Blank et al. [8], who used silhouettes to construct a space-time volume. In this study properties of the solution to the Poisson equation were utilized for activity recognition. In Kellokumpu et al., [9], a texture descriptor is used to characterize Motion History Images. In this study it is shown that a collection of local features can form a very robust description of human movement. In Boiman, and Irani [10] a new notion of similarity between signals is proposed. The regions of the “query” signal which can be composed using large contiguous chunks of data from the “reference” signal are considered to have high local similarity. Finally, in Ikizler, and Duygulu [11], a human pose, divided into rectangular patches based on their orientations, is represented by histogram of extracted. Most of these earlier studies provide a number of datasets that were developed specifically to evaluate the systems reported in the various experiments: the KTH human motion dataset [12], which includes sequences of 25 actors performing 6 actions, the INRIA XMAS multi-view dataset [13], which contains 14 actions from 11 subjects captured from 5 viewpoints, the UCF sports dataset [14], the Hollywood human action dataset [15], a set of 8 actions performed by a variety of actors, and the Weizmann human action dataset [8], which contains recorded sequences of 10 actions from 10 actors. The Weizmann dataset, used in the studies reported in this paper, has become a widely used dataset for comparing action classification systems. In this paper, we show that human action classification is best handled by combining multiple descriptors to boost performance. We combine three sets of features in order to obtain a reliable method for human action classification. In particular, we show that the response of Gabor filters and the standard application of the Local Binary Patterns to the mask images available in the Weizmann dataset obtains a >90% accuracy. The complete system, based on the combination of the features proposed in this paper with the histogram of gradients, obtains an accuracy of 97.8% using the 10-classes Weizmann dataset and a 100% of accuracy in the 9-classes Weizmann dataset. What our experiments highlight is the fact that fusion among simple feature extractors can obtain results comparable to the state of the art [9-11, 16]. We also show Int'l Conf. IP, Comp. Vision, and Pattern Recognition | IPCV'09 | that Local Binary Patterns (LBP) can be applied directly to the masks. Thus far only a variant of the LBP, Local Binary Patterns from Three Orthogonal Planes, has been tested for this problem [16]. Even though LBP-TOP has obtained better results with respect to our simple LBP, more frames must be considered in the feature extraction process using LBP-TOP, thereby increasing computational complexity. Because of this, LBP-TOP, unlike our method, could not function in a realtime system. The remainder of this paper is outlined as follows. In section 2 we briefly describe our system architecture. In section 3 we provide a detailed description of the descriptors used in our experiments. In section 4 we provide experimental results. Finally, in section 5 we conclude this paper by noting contributions and offering a few suggestions for future research. 2 System architecture In Figure 1 we provide a schematic of our complete system. We use the masked images available in the Weizmann 1 dataset [8, 17]. As mentioned above, this dataset is becoming a popular benchmark in this task domain. There are two different versions of the Weizmann dataset: the standard 10-class dataset and a reduced version that does not include the Skipping class (the 9-classes Weizmann dataset). The ten actions, illustrated in Figure 2, are performed by 9 subjects. In our system we resized the masks to 150 x 105. For more information on the dataset and masks, see [8, 17]. 981 In the experiments reported in this paper, we combine the following: Gabor Filters, invariant local binary patterns, and histogram of oriented gradients. The descriptors are extracted from the mask images using a background subtraction algorithm (for the source code we used, see http:// maven. smith.edu/~nhowe/research/code/) [18]. The feature vector that describes a given sequence is obtained simply by summing the features extracted from each stand-alone frame. Since two of the descriptors we use are novel for this task, we provide a detailed description of feature extraction in section 3. In the classification step a random subspace ensemble (RS) of support vector machines [19] is used. In the RS method each classifier is trained with a subset of all available features. We combine 50 linear support vector machines, each trained with 50% of all available features. For each set of features, a separate random subspace is trained. These three systems are then combined using the sum rule. 3 Human action classification descriptors In this section we describe the descriptors used in our system: Gabor Filters, invariant local binary patterns, and histogram of oriented gradients. 3.1 Gabor filters This descriptor is based on the well known FingerCode [20, 21] method developed for fingerprint matching. We named our approach for human action classification ActionCode. Figure 1. Proposed fusion system using new descriptors 1 Available at http://www.wisdom.weizmann.ac.il/~vision/Space TimeActions.html 982 Int'l Conf. IP, Comp. Vision, and Pattern Recognition | IPCV'09 | Figure 2. Some samples of the 10 action classes in the Weizmann dataset. The basic ActionCode (AC) algorithm is as follows: Step 1) Tessellate the region of interest around the center of the image 2 ; Step 2) Filter the region of interest using a bank of Gabor filters; Step 3) Compute the average absolute deviation from the mean of gray values in individual sectors in filtered images. The region of interest (see Figure 3) is divided into 7 bands. In each band, 24 sectors are extracted (see [20] for details). A symmetric Gabor filter has the following general form in the spatial domain [21]: ⎛ x′2 + y′2 G ( x , y ;ν , σ , θ ) = exp ⎜⎜ − 2σ 2 ⎝ x ′ = x sin θ + y cos θ y ′ = x cos θ − y sin θ where ν is the frequency of the sinusoidal wave, θ is the orientation and σ is the standard deviation of the Gaussian envelope. In our experiments, the filters are obtained considering 12 angles (equally spaced between 0° and 180°). 3.2 Figure 3. Region of interest in an image mask. It is supposed that the mask images are aligned. Invariant Local Binary Patterns This operator [22] has several properties of interest: it is low in computational complexity, and it is robust in terms of illumination changes and rotation invariant. The Local Binary Pattern is a histogram that is based on a statistical operator calculated by examining the joint distribution of gray scale values of a circularly symmetric neighborhood set of P pixels around a pixel x on a circle of radius R. In this study we use a multi-resolution descriptor that is obtained by concatenating three histograms calculated with the following parameters: (P=8; R=1) and (P=16; R=2). Each mask image is divided into 5×6 equal non-overlapping regions, 3 in each subregion the histograms are calculated, and then these 5×6=30 histograms are concatenated. 3 2 ⎞ ⎟⎟ ⋅ cos (2πν x ′ ) ⎠ We have used these values because they are the default values offered in Poppe's Matlab code [7] and offer comparison. Int'l Conf. IP, Comp. Vision, and Pattern Recognition | IPCV'09 | 3.3 983 Table 2. Results on the 9 classes Weizmann dataset. Histogram of oriented gradients The histogram of oriented gradients (HOG) was first proposed by Dalal and Triggs [23] as an image descriptor for localizing pedestrians. In this work we use weighted HOGs as implemented in [1], where the subregions are obtained by dividing each image cell into 5×6 equal non-overlapping regions.3 In each subregion the orientation and magnitude of each pixel is calculated. The absolute orientations are discretized over 9 equally sized bins in the 0°- 180° range, and the resulting 9-bin histogram is calculated weighting each pixel by the magnitude of its orientation according to the histogram bin. Method Stand-alone descriptor Fusion LBP-Top [16] Boiman and Irani [10] Ikizler and Duyguku [11] HOG 96.6% 100.0% 98.7% 97.8% 100.0% AC 0.33 s LBP 0.05 s HOG 0.005 s Experimental results In Tables 1-2 we report our results and compare them with the state of the art. The method named Fusion is the combination by sum rule of the three systems tested. Our classification results using fusion (97.8% for the 10 class and 100% for the 9-class Weizmann dataset) matches the best performing systems reported to date using the same datasets [9-11, 16]. This demonstrates that simple feature extractors can obtain results comparable to the state of the art. Of course, these results are merely preliminary since only the Weizmann dataset is tried. Table 1. Results on the 10-classes Weizmann dataset. Method Stand-alone descriptor Fusion LBP-Top [16] Kellpkumpu et al [9] AC 90.3% LBP 91.4% HOG 94.6% 97.8% 95.6% 97.8% These experimental results confirm what is well known in the machine learning community, namely, that combining methods is a simple approach for improving performance (see e.g., [24] in biometrics and [25] in bioinformatics). Finally, in Table 3 we report the computational time for the extraction of each descriptor. These results are obtained using MATLAB 7.5 on a 2 GhZ Dual Core. Notice that the times are obtained considering a single frame. These results show that both LBP and HOG can be extracted in real time. In our opinion it is possible to obtain real time computation of AC using GPU. 4 4 LBP 92.8% Table 3. Computational time Computational Time 4 AC 95.2% Now a full GPU engine for MATLAB built on NVIDIA's CUDA technology named Jacket is available http://www.accelereyes.com/ 5 Conclusions This paper focused on the study of descriptors for training an ensemble of machine learning algorithms for human action classification. We propose combining three relatively simple feature extractors for obtaining a system that performs as well as more complex systems. The ensemble proposed in this work has been tested on the Weizmann dataset, which is one of the most widely used benchmarks for comparing human action classification approaches. Our fusion results of 97.8% accuracy in the 10-class Weizmann dataset and a 100% accuracy in the 9-class Weizmann dataset performed as well as the best performance reported thus far. This study makes a number of contributions. This is the first study to use fusion for human action classification. In addition, we introduce two new descriptors for this task: one based on the response of Gabor filters and the other based on the standard application of the Local Binary Patterns to the mask images. Although classification using these two descriptors (without combining descriptors) do not compare as well with the state of the art, they could be combined with other systems in order to further improve performance. Finally, our system of combining simple descriptors in fusion compares as well as more sophisticated systems but has the advantage of being computationally less intensive. It is very likely that our system could be used in a real-time system. In future studies we plan on testing other fusion methods, in particular weighted approaches, where each method has different weights. In this way the best performing approaches can be given more weight in the classification combination step. We also want to test our methods using some of the other datasets mentioned in the introduction. 984 6 Int'l Conf. IP, Comp. Vision, and Pattern Recognition | IPCV'09 | Acknowledgements This work was funded in part by Missouri State University Futures Grant: AI-ARTISTIC PROC 6-2011. The authors would like to thank Vonda Yarberry and Ruth Barnes, members of the AI-ARTISTIC PROC project, for their support. We would also like to thank Ronald Poppe at the Human Media Interaction Group, Department of Computer Science, University of Twente for sharing the Matlab code for the histogram of gradients. The Matlab code of the LBP is available at http://www.ee.oulu.fi/mvg/page/lbp_matlab. 7 [1] References R. Poppe, “Evaluating example-based pose estimation: Experiments on the humaneva sets,” in CVPR 2nd Workshop on Evaluation of Articulated Human Motion and Pose Estimation, 2007. [2] Jake K. Aggarwal, and Qin Cai, “Human motion analysis: A review,” Computer Vision and Image Understanding, vol. 73, no. 3, pp. 428-440, 1999. [3] Aaron F. Bobick, “Movement, activity and action: The role of knowledge in the perception of motion,” Philological Transactions of the Royal Society B: Biological Sciences, vol. 352, no. 1358, pp. 1257-1265, 1997. [4] Dariu M. Gavrila, “The visual analysis of human movement: A survey,” Computer Vision and Image Understanding, vol. 73, no. 1, pp. 82-92, 1999. [5] Volker Krüger, Danica Kragic, Ašes Ide et al., “The meaning of action: A review of action recognition and mapping,” Advanced Robotics, vol. 21, no. 13, pp. 14731501, 2007. [6] Antonius Oikonomopoulos, Maja Pantic, and Ioannis Patras, “B-spline polynomial descriptors for human activity recognition,” in Workshop on Computer Vision and Pattern Recognition for Human Communicative Behavior Analysi, Anchorage, AK, 2008, pp. 1-8. [7] R. Poppe, “Discriminative vision-based recovery and recognition of human motion,” Human Media Interaction, University of Twente, Tilburg, The Netherlands, 2009. [8] M. Blank, L. Gorelick, E. Shechtman et al., “Actions as space-time shapes,” in International conference on computer vision, Beijing, China, 2005, pp. 1395-1402. [9] V. Kellokumpu, G. Zhao, and M. Pietikäinen, “Texture based description of movements for activity analysis,” in VISAPP, 2008, pp. 206 – 213. [10] O. Boiman, and M. Irani, “Similarity by composition,” in Neural Information Processing Systems (NIPS), 2006. [11] N. Ikizler, and P. Duygulu, “Human action recognition using distribution of oriented rectangular patches,” in ICCV workshop on Human Motion Understanding, Modeling, Capture and Animation, 2007. [12] Christian Schüldt, Ivan Laptev, and Caputo Barbara, “Recognizing human actions: A local svm approach,” in International conference on pattern recognition, Cambridge, United Kingdom, 2004, pp. 32-36. [13] Daniel Weinland, Remi Ronfard, and Edmond Boyer, “Free viewpoint action recognition using motion history volumes,” Computer Vision and Image Understanding, vol. 104, no. 2-3, pp. 249-257, 2006. [14] Mikel D. Rodriquez, Javed Ahmed, and Mubarak Shah, “Action MACH: A spatio-temporal maximum average correlation height filter for action recognition,” in Conference on Computer Vision and Pattern Recognition, Anchorage, AK, 2008, pp. 1-8. [15] Ivan Laptev, Marcin Marszalek, Cordelia Schmid et al., “Learning realistic human action from movies,” in Computer vision and Pattern Recognition, Anchorage, AK, 2008. [16] V. Kellokumpu, G. Zhao, and M. Pietikäinen, “Human activity recognition using a dynamic texture based method,” in BMVC08, 2008. [17] L. Gorelick, Moshe Blank, Eli Shechtman et al., “Actions as space-time shapes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2247-2253, 2007. [18] N. Howe, and A. Deschamps, Better foreground segmentation through graph cuts, arXiv.org Tech Report arXiv:cs/0401017v2, 2004. [19] R. O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, 2nd ed., New York: Wiley, 2000. [20] A. K. Jain, S. Prabhakar, L. Hong et al., “Filterbankbased fingerprint matching,” IEEE Transactions on Image Processing, vol. 9, no. 5, pp. 846 – 859, 2000. [21] L. Nanni, and A.Lumini, “Two-class fingerprint matcher,” Pattern Recognition, vol. 39, no. 4, pp. 714716, 2006. [22] Timo Ojala, Matti Pietikainen, and Topi Maeenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” Ieee transactions on pattern analysis and machine intelligence, vol. 24, no. 7, pp. 971-987, 2002. [23] N. Dalal, and B. Triggs, “Histograms of oriented gradients for human detection,” in 9th European Conference on Computer Vision, San Diego, CA, 2005. [24] D. Maio, and L. Nanni, “Multihashing, human authentication featuring biometrics data and tokenised random number: A case study ” NeuroComputing, vol. 69, no. December, pp. 242-249 2005. [25] L. Nanni, and A. Lumini, “Ensemblator: An ensemble of classifiers for reliable classification of biological data,” Pattern Recognition Letters, vol. 28, no. 5, pp. 622-630, 2007.