Abstract
This paper proposes a new framework for RGB-D-based action recognition that takes advantages of hand-designed features from skeleton data and deeply learned features from depth maps, and exploits effectively both the local and global temporal information. Specifically, depth and skeleton data are firstly augmented for deep learning and making the recognition insensitive to view variance. Secondly, depth sequences are segmented using the handcrafted features based on skeleton joints motion histogram to exploit the local temporal information. All training segments are clustered using an Infinite Gaussian Mixture Model (IGMM) through Bayesian estimation and labelled for training Convolutional Neural Networks (ConvNets) on the depth maps. Thus, a depth sequence can be reliably encoded into a sequence of segment labels. Finally, the sequence of labels is fed into a joint Hidden Markov Model and Support Vector Machine (HMM-SVM) classifier to explore the global temporal information for final recognition. The proposed framework was evaluated on the widely used MSRAction-Pairs, MSRDailyActivity3D and UTD-MHAD datasets and achieved promising results.
Similar content being viewed by others
References
Althloothi S, Mahoor MH, Zhang X, Voyles RM (2014) Human activity recognition using multi-features and multiple kernel learning. Pattern Recogn 47(5):1800–1812
Chen C, Jafari R, Kehtarnavaz N (2015) Utd-mhad: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: 2015 IEEE International conference on image processing (ICIP), pp 168–172
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: IEEE International conference on computer vision, pp 1110–1118
Eweiwi A, Cheema MS, Bauckhage C, Gall J (2014) Efficient pose-based action recognition. In: ACCV, pp 428–443
Fraley C, Raftery AE (2007) Bayesian regularization for normal mixture estimation and model-based clustering. J Classif 24(2):155–181
Gelman A, Carlin JB, Stern HS, Rubin DB Bayesian data analysis, 2nd edn. Crc Pr I Llc
Gowayyed MA, Torki M, Hussein ME, El-Saban M (2013) Histogram of oriented displacements (hod): describing trajectories of human joints for action recognition. In: International joint conference on artificial intelligence, pp 1351–1357
Griffiths T, Ghahramani Z (2005) Infinite latent feature models and the indian buffet process. Adv Neural Inf Process Syst 18:475–482
Jia Y, Shelhamer E, Donahue J, Karayev S, Long JC (2014) Convolutional architecture for fast feature embedding. Eprint Arxiv 675–678
Kong Y, Fu Y (2015) Bilinear heterogeneous information machine for rgb-d action recognition. In: CVPR, pp 1054–1062
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25(2):2012
Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3d points. In: 2010 IEEE Computer society conference on computer vision and pattern recognition workshops (CVPRW), pp 9–14
Neal RM (2010) Markov chain sampling methods for Dirichlet process mixture models. J Comput Graph Stat 9(9):249–265
Oreifej O, Liu Z (2013) Hon4d: histogram of oriented 4d normals for activity recognition from depth sequences. In: IEEE Conference on computer vision and pattern recognition, pp 716–723
Shao L, Ji L (2009) Motion histogram analysis based key frame extraction for human action/activity representation. In: Canadian conference on computer and robot vision, pp 88–92
Shotton J, Fitzgibbon A, Cook M, Sharp T (2011) Real-time human pose recognition in parts from single depth images. In: Computer vision and pattern recognition, pp 1297–1304
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: IEEE Conference on computer vision and pattern recognition, pp 588–595
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: IEEE Conference on computer vision and pattern recognition, pp 1290–1297
Wang J, Liu Z, Wu Y (2014) Learning actionlet ensemble for 3d human action recognition. IEEE Trans Pattern Anal Mach Intell 36(5):914–927
Wang P, Li W, Ogunbona P, Gao Z (2014) Mining mid-level features for action recognition based on effective skeleton representation. In: International conference on digital lmage computing: techniques and applications, pp 1 – 8
Wang P, Li W, Gao Z, Tang C, Zhang J, Ogunbona P (2015) Convnets-based action recognition from depth maps through virtual cameras and pseudocoloring. In: ACM MM, pp 1119–1122
Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona P (2015) Action recognition from depth maps using deep convolutional neural networks. IEEE Trans Human-Mach Syst 46(4):498–509
Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona P (2016) Action recognition from depth maps using deep convolutional neural networks. IEEE Trans Human-Mach Syst 46(4):498–509
Wood F, Goldwater S, Black MJ (2006) A non-parametric bayesian approach to spike sorting 1(1):1165–1168
Wu D, Shao L (2014) Deep dynamic neural networks for gesture segmentation and recognition. Springer International Publishing
Xia L, Aggarwal JK (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: Computer vision and pattern recognition, pp 2834–2841
Xiaodong Y, Tian YL (2012) Eigenjoints-based action recognition using naïve-bayes-nearest-neighbor. In: Computer vision and pattern recognition workshops, pp 14–19
Yang X, Tian YL (2014) Super normal vector for activity recognition using depth sequences. In: IEEE Conference on computer vision and pattern recognition, pp 804–811
Yang X, Zhang C, Tian YL (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. In: ACM International conference on multimedia, pp 1057–1060
Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: an efficient 3d kinematics descriptor for low-latency action recognition and detection. In: IEEE International conference on computer vision, pp 2752–2759
Zhou L, Li W, Zhang Y, Ogunbona P, Nguyen DT, Zhang H (2014) Discriminative key pose extraction using extended lc-ksvd for action recognition. In: DICTA. IEEE
Acknowledgements
This work was funded by the National Natural Science Foundation of China (NO. 61571325 and 61502357) and the Fundamental Research Funds for the Central Universities, China University of Geosciences (Wuhan) (NO. CUG170654).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, S., Hou, Y., Li, Z. et al. Combining ConvNets with hand-crafted features for action recognition based on an HMM-SVM classifier. Multimed Tools Appl 77, 18983–18998 (2018). https://doi.org/10.1007/s11042-017-5335-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-5335-0