Skip to main content

Advertisement

Log in

Video spatiotemporal mapping for human action recognition by convolutional neural network

  • Theoretical advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

In this paper, a 2D representation of a video clip called video spatiotemporal map (VSTM) is presented. VSTM is a compact representation of a video clip which incorporates its spatial and temporal properties. It is created by vertical concatenation of feature vectors generated from subsequent frames. The feature vector corresponding to each frame is generated by applying wavelet transform to that frame (or its subtraction from the subsequent frame) and computing vertical and horizontal projection of quantized coefficients of some specific wavelet subbands. VSTM enables convolutional neural networks (CNNs) to process a video clip for human action recognition (HAR). The proposed approach benefits from power of CNNs to analyze visual patterns and attempts to overcome some CNN challenges such as variable video length problem and lack of training data that leads to over-fitting. VSTM presents a sequence of frames to CNN without imposing any additional computational cost to the CNN learning algorithm. The experimental results of the proposed method on the KTH, Weizmann, and UCF Sports HAR benchmark datasets have shown the supremacy of the proposed method compared with the state-of-the-art methods that used CNN to solve HAR problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Wang X (2013) Intelligent multi-camera video surveillance: a review. Pattern Recognit Lett 34:3–19. https://doi.org/10.1016/j.patrec.2012.07.005

    Article  Google Scholar 

  2. Liu C, Hu C, Liu Q, Aggarwal JK (2013) Video event description in scene context. Neurocomputing. 119:82–93. https://doi.org/10.1016/j.neucom.2012.03.037

    Article  Google Scholar 

  3. Hu W, Xie N, Li L, Zeng X, Maybank S (2011) A survey on visual content-based video indexing and retrieval. IEEE Trans Syst Man Cybern 41:797–816

    Article  Google Scholar 

  4. Zhu F, Shao L, Xie J, Fang Y (2016) From handcrafted to learned representations for human action recognition: a survey. Image Vis Comput 55(2):42–52. https://doi.org/10.1016/j.imavis.2016.06.007

    Article  Google Scholar 

  5. Enzweiler M, Gavrila DM (2009) Monocular pedestrian detection: survey and experiments. IEEE Trans Pattern Anal Mach Intell 31:2179–2195

    Article  Google Scholar 

  6. Barr P, Noble J, Biddle R (2007) Video game values: human–computer interaction and games. Interact Comput 19:180–195. https://doi.org/10.1016/j.intcom.2006.08.008

    Article  Google Scholar 

  7. Gowsikhaa D, Abirami S, Baskaran R (2014) Automated human behavior analysis from surveillance videos: a survey. Artif Intell Rev 42:747–765

    Article  Google Scholar 

  8. Afsar P, Cortez P, Santos H (2015) Automatic visual detection of human behavior: a review from 2000 to 2014. Expert Syst Appl 42:6935–6956. https://doi.org/10.1016/j.eswa.2015.05.023

    Article  Google Scholar 

  9. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1–9

  10. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778

  11. Wan L, Zeiler M, Zhang S, LeCun Y, Fergus R (2013) Regularization of neural networks using dropconnect. In: International conference on machine learning, ICML, pp 109–111

  12. Wang X, Zhang L, Lin L, Liang Z, Zuo W (2014) Deep joint task learning for generic object extraction. In: Advances in neural information processing systems, pp 523–531

  13. Guo Y, Liu Y, Oerlemans A, Lao S, Wu S, Lew MS (2015) Deep learning for visual understanding: a review. Neurocomputing. https://doi.org/10.1016/j.neucom.2015.09.116

    Article  Google Scholar 

  14. Liu Y, Guo Y, Wu S, Lew MS (2015) Deepindex for accurate and efficient image retrieval. In: Proceedings of the 5th ACM on international conference on multimedia retrieval, pp 43–50

  15. Wan J, Wang D, Hoi SCH, Wu P, Zhu J, Zhang Y, Li J (2014) Deep learning for content-based image retrieval: a comprehensive study. In: Proceedings of the 22nd ACM international conference on multimedia, pp 157–166

  16. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780

    Article  Google Scholar 

  17. Ordóñez FJ, Roggen D (2016) Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors (Switzerland). https://doi.org/10.3390/s16010115

    Article  Google Scholar 

  18. Ng JY, Hausknecht MJ, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4694–4702

  19. Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634

  20. Ji S, Yang M, Yu K, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35:221–231. https://doi.org/10.1109/TPAMI.2012.59

    Article  Google Scholar 

  21. Diba A, Pazandeh AM, Van Gool L (2016) Efficient two-stream motion and appearance 3D CNNs for video classification. In: ECCV’16

  22. Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV). IEEE, pp 4597–4605

  23. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2016) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  24. Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: Salah AA, Lepri B (eds) Human behavior understanding. Springer, Berlin, pp 29–39

    Chapter  Google Scholar 

  25. Natarajan P, Singh VK, Nevatia R (2010) Learning 3D action models from a few 2D videos for view invariant action recognition. In: Computer vision and pattern recognition (CVPR). IEEE, pp 2006–2013

  26. Luvizon DC, Tabia H, Picard D (2017) Learning features combination for human action recognition from skeleton sequences. Pattern Recognit Lett. https://doi.org/10.1016/j.patrec.2017.02.001

    Article  Google Scholar 

  27. Jagadeesh B, Patil CM (2016) Video based action detection and recognition human using optical flow and SVM classifier. In: 2016 IEEE international conference on recent trends in electronics, information communication technology (RTEICT), pp 1761–1765

  28. Mocanu DC, Bou Ammar H, Lowet D, Driessens K, Liotta A, Weiss G, Tuyls K (2015) Factored four way conditional restricted Boltzmann machines for activity recognition. Pattern Recognit Lett 66:100–108. https://doi.org/10.1016/j.patrec.2015.01.013

    Article  Google Scholar 

  29. Chaudhry R, Ravichandran A, Hager G, Vidal R (2009) Histograms of oriented optical flow and Binet–Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1932–1939

  30. Lertniphonphan K, Aramvith S, Chalidabhongse TH (2011) Human action recognition using direction histograms of optical flow. In: 11th International symposium on communications and information technologies (ISCIT), pp 574–579

  31. Chun S, Lee CS (2016) Human action recognition using histogram of motion intensity and direction from multiple views. IET Comput Vis 10:250–256. https://doi.org/10.1049/iet-cvi.2015.0233

    Article  Google Scholar 

  32. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Computer vision and pattern recognition (CVPR), pp 886–893

  33. Bay H, Ess A, Tuytelaars T, Gool L Van (2008) Speeded-up robust features (SURF). Comput Vis Image Underst 110:346–359. https://doi.org/10.1016/j.cviu.2007.09.014

    Article  Google Scholar 

  34. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Comput Vis 60:91–110

    Article  Google Scholar 

  35. Wang H, Klaser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103:60–79

    Article  MathSciNet  Google Scholar 

  36. Wang H, Ullah MM, Klaser A, Laptev I, Schmid C, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British machine vision conference, pp 121–124

  37. Ballan L, Bertini M, Bimbo A Del, Seidenari L, Serra G (2012) Effective codebooks for human action representation and classification in unconstrained videos. IEEE Trans Multimed 14:1234–1245. https://doi.org/10.1109/TMM.2012.2191268

    Article  Google Scholar 

  38. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE international conference on computer vision and pattern recognition, pp 1–8

  39. Coniglio C, Meurie C, Lézoray O, Berbineau M (2017) People silhouette extraction from people detection bounding boxes in images. Pattern Recognit Lett 93:182–191. https://doi.org/10.1016/j.patrec.2016.12.014

    Article  Google Scholar 

  40. Zeng W, Wang C, Yang F (2014) Silhouette-based gait recognition via deterministic learning. Pattern Recognit 47:3568–3584. https://doi.org/10.1016/j.patcog.2014.04.014

    Article  Google Scholar 

  41. Willems G, Tuytelaars T, Gool L-V (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: 10th European conference on computer vision. Springer, Berlin, pp 650–663

    Google Scholar 

  42. Kim H-J, Lee JS, Yang H-S (2007) Human action recognition using a modified convolutional neural network. In: Liu D, Fei S, Hou Z, Zhang H, Sun C (eds) Proceedings of the 4th international symposium on neural networks: part II—advances in neural networks. Springer, Berlin, pp 715–723

    Chapter  Google Scholar 

  43. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th international conference on neural information processing systems. MIT Press, Cambridge, pp 568–576

  44. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition. IEEE Computer Society, Washington, DC, pp 1725–1732

  45. Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4305–4314

  46. Wang P, Cao Y, Shen C, Liu L, Shen HT (2016) Temporal pyramid pooling based convolutional neural network for action recognition. IEEE Trans Circuits Syst Video Technol PP:1. https://doi.org/10.1109/tcsvt.2016.2576761

    Article  Google Scholar 

  47. Abrishami Moghaddam H, Taghizadeh Khajoie T, Rouhi AH, Saadatmand-Tarzjan M (2005) Wavelet correlogram: a new approach for image indexing and retrieval. Pattern Recognit 38:2506–2518. https://doi.org/10.1016/j.patcog.2005.05.010

    Article  Google Scholar 

  48. Laptev I (2005) On space-time interest points. Int J Comput Vis 64:107–123. https://doi.org/10.1007/s11263-005-1838-7

    Article  Google Scholar 

  49. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. Int Conf Learn Represent. https://doi.org/10.1016/j.infsof.2008.09.005

    Article  Google Scholar 

  50. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML’15, JMLR.org, pp 448–456

  51. Yu D, Eversole A, Seltzer MLM, Yao K, Kuchaiev O, Zhang Y, Seide F, Huang Z, Guenter B, Wang H, Droppo J, Zweig G, Rossbach C, Gao J, Stolcke A, Currey J, Slaney M, Chen G, Agarwal A, Basoglu C, Padmilac M, Kamenev A, Ivanov V, Cypher S, Parthasarathi H, Mitra B, Peng B, Huang X, Akchurin E, Basoglu C, Chen G, Cyphers S, Droppo J, Eversole A, Guenter B, Hillebrand M, Huang X, Huang Z, Ivanov V, Kamenev A, Kranen P, Kuchaiev O, Manousek W, Orlov A, Padmilac M, Parthasarathi H, Peng B, Reznichenko A, Seide F, Seltzer MLM, Slaney M, Stolcke A, Wang H, Yao K, Yu D (2014) An introduction to computational networks and the computational network toolkit. Microsoft Research, Redmond

    Google Scholar 

  52. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proceedings of the IEEE international conference on pattern recognition, pp 32–36

  53. Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2012) Spatio-temporal convolutional sparse auto-encoder for sequence classification. In: BMVC, pp 1–12

  54. Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: Daniilidis K, Maragos P, Paragios N (eds) Computer vision—ECCV 2010. Springer, Berlin, pp 140–153

    Chapter  Google Scholar 

  55. Shi Y, Zeng W, Huang T, Wang Y (2015) Learning deep trajectory descriptor for action recognition in videos using deep neural networks. In: 2015 IEEE international conference on multimedia and expo (ICME), pp 1–6

  56. Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space–time shapes. IEEE Trans Pattern Anal Mach Intell 29:2247–2253

    Article  Google Scholar 

  57. Nasiri JA, Moghadam Charkari N, Mozafari K (2014) Energy-based model of least squares twin support vector machines for human action recognition. Sig Process 104:248–257

    Article  Google Scholar 

  58. Sheng B, Yang W, Sun C (2015) Action recognition using direction-dependent feature pairs and non-negative low rank sparse model. Neurocomputing 158:73–80

    Article  Google Scholar 

  59. Dou J, Liu J (2014) Robust human action recognition based on spatio-temporal descriptors and motion temporal templates. Optik (Stuttg) 125:1891–1896

    Article  Google Scholar 

  60. Klaser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: Proceedings of the British machine vision conference

  61. Wang L, Li R, Fang Y (2016) Gradient-layer feature transform for action detection and recognition. J Vis Commun Image Represent 40:159–167. https://doi.org/10.1016/j.jvcir.2016.06.023

    Article  Google Scholar 

  62. Al-Azzo F, Bao C, Taqi AM, Milanova M, Ghassan N (2017) Human actions recognition based on 3D deep neural network. In: 2017 Annual conference on new trends in information and communications technology applications (NTICT), pp 240–246

  63. Soomro K, Zamir AR (2014) Action recognition in realistic sports videos. In: Moeslund TB, Thomas G, Hilton A (eds) Computer vision in sports. Springer, Cham, pp 181–208

    Chapter  Google Scholar 

  64. Lan T, Wang Y, Mori G (2011) Discriminative figure-centric models for joint action localization and recognition. In: 2011 International conference on computer vision, pp 2003–2010

  65. Raptis M, Kokkinos I, Soatto S (2012) Discovering discriminative action parts from mid-level video representations. In: 2012 IEEE conference on computer vision and pattern recognition, pp 1242–1249

  66. Ma S, Zhang J, Ikizler-Cinbis N, Sclaroff S (2013) Action recognition and localization by hierarchical space–time segments. In: 2013 IEEE international conference on computer vision, pp 2744–2751

  67. Yu J, Jeon M, Pedrycz W (2014) Weighted feature trajectories and concatenated bag-of-features for action recognition. Neurocomputing 131:200–207

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hamid Abrishami Moghaddam.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zare, A., Abrishami Moghaddam, H. & Sharifi, A. Video spatiotemporal mapping for human action recognition by convolutional neural network. Pattern Anal Applic 23, 265–279 (2020). https://doi.org/10.1007/s10044-019-00788-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-019-00788-1

Keywords

Navigation