Video spatiotemporal mapping for human action recognition by convolutional neural network

Zare, Amin; Abrishami Moghaddam, Hamid; Sharifi, Arash

doi:10.1007/s10044-019-00788-1

Video spatiotemporal mapping for human action recognition by convolutional neural network

Theoretical advances
Published: 06 February 2019

Volume 23, pages 265–279, (2020)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Amin Zare¹,
Hamid Abrishami Moghaddam² &
Arash Sharifi¹

591 Accesses
19 Citations
Explore all metrics

Abstract

In this paper, a 2D representation of a video clip called video spatiotemporal map (VSTM) is presented. VSTM is a compact representation of a video clip which incorporates its spatial and temporal properties. It is created by vertical concatenation of feature vectors generated from subsequent frames. The feature vector corresponding to each frame is generated by applying wavelet transform to that frame (or its subtraction from the subsequent frame) and computing vertical and horizontal projection of quantized coefficients of some specific wavelet subbands. VSTM enables convolutional neural networks (CNNs) to process a video clip for human action recognition (HAR). The proposed approach benefits from power of CNNs to analyze visual patterns and attempts to overcome some CNN challenges such as variable video length problem and lack of training data that leads to over-fitting. VSTM presents a sequence of frames to CNN without imposing any additional computational cost to the CNN learning algorithm. The experimental results of the proposed method on the KTH, Weizmann, and UCF Sports HAR benchmark datasets have shown the supremacy of the proposed method compared with the state-of-the-art methods that used CNN to solve HAR problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

Fig. 7

BDNet: a method based on forward and backward convolutional networks for action recognition in videos

Article 17 October 2023

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Learning multi-temporal-scale deep information for action recognition

Article 01 December 2018

References

Wang X (2013) Intelligent multi-camera video surveillance: a review. Pattern Recognit Lett 34:3–19. https://doi.org/10.1016/j.patrec.2012.07.005
Article Google Scholar
Liu C, Hu C, Liu Q, Aggarwal JK (2013) Video event description in scene context. Neurocomputing. 119:82–93. https://doi.org/10.1016/j.neucom.2012.03.037
Article Google Scholar
Hu W, Xie N, Li L, Zeng X, Maybank S (2011) A survey on visual content-based video indexing and retrieval. IEEE Trans Syst Man Cybern 41:797–816
Article Google Scholar
Zhu F, Shao L, Xie J, Fang Y (2016) From handcrafted to learned representations for human action recognition: a survey. Image Vis Comput 55(2):42–52. https://doi.org/10.1016/j.imavis.2016.06.007
Article Google Scholar
Enzweiler M, Gavrila DM (2009) Monocular pedestrian detection: survey and experiments. IEEE Trans Pattern Anal Mach Intell 31:2179–2195
Article Google Scholar
Barr P, Noble J, Biddle R (2007) Video game values: human–computer interaction and games. Interact Comput 19:180–195. https://doi.org/10.1016/j.intcom.2006.08.008
Article Google Scholar
Gowsikhaa D, Abirami S, Baskaran R (2014) Automated human behavior analysis from surveillance videos: a survey. Artif Intell Rev 42:747–765
Article Google Scholar
Afsar P, Cortez P, Santos H (2015) Automatic visual detection of human behavior: a review from 2000 to 2014. Expert Syst Appl 42:6935–6956. https://doi.org/10.1016/j.eswa.2015.05.023
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1–9
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778
Wan L, Zeiler M, Zhang S, LeCun Y, Fergus R (2013) Regularization of neural networks using dropconnect. In: International conference on machine learning, ICML, pp 109–111
Wang X, Zhang L, Lin L, Liang Z, Zuo W (2014) Deep joint task learning for generic object extraction. In: Advances in neural information processing systems, pp 523–531
Guo Y, Liu Y, Oerlemans A, Lao S, Wu S, Lew MS (2015) Deep learning for visual understanding: a review. Neurocomputing. https://doi.org/10.1016/j.neucom.2015.09.116
Article Google Scholar
Liu Y, Guo Y, Wu S, Lew MS (2015) Deepindex for accurate and efficient image retrieval. In: Proceedings of the 5th ACM on international conference on multimedia retrieval, pp 43–50
Wan J, Wang D, Hoi SCH, Wu P, Zhu J, Zhang Y, Li J (2014) Deep learning for content-based image retrieval: a comprehensive study. In: Proceedings of the 22nd ACM international conference on multimedia, pp 157–166
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Article Google Scholar
Ordóñez FJ, Roggen D (2016) Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors (Switzerland). https://doi.org/10.3390/s16010115
Article Google Scholar
Ng JY, Hausknecht MJ, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4694–4702
Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Ji S, Yang M, Yu K, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35:221–231. https://doi.org/10.1109/TPAMI.2012.59
Article Google Scholar
Diba A, Pazandeh AM, Van Gool L (2016) Efficient two-stream motion and appearance 3D CNNs for video classification. In: ECCV’16
Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV). IEEE, pp 4597–4605
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2016) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: Salah AA, Lepri B (eds) Human behavior understanding. Springer, Berlin, pp 29–39
Chapter Google Scholar
Natarajan P, Singh VK, Nevatia R (2010) Learning 3D action models from a few 2D videos for view invariant action recognition. In: Computer vision and pattern recognition (CVPR). IEEE, pp 2006–2013
Luvizon DC, Tabia H, Picard D (2017) Learning features combination for human action recognition from skeleton sequences. Pattern Recognit Lett. https://doi.org/10.1016/j.patrec.2017.02.001
Article Google Scholar
Jagadeesh B, Patil CM (2016) Video based action detection and recognition human using optical flow and SVM classifier. In: 2016 IEEE international conference on recent trends in electronics, information communication technology (RTEICT), pp 1761–1765
Mocanu DC, Bou Ammar H, Lowet D, Driessens K, Liotta A, Weiss G, Tuyls K (2015) Factored four way conditional restricted Boltzmann machines for activity recognition. Pattern Recognit Lett 66:100–108. https://doi.org/10.1016/j.patrec.2015.01.013
Article Google Scholar
Chaudhry R, Ravichandran A, Hager G, Vidal R (2009) Histograms of oriented optical flow and Binet–Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1932–1939
Lertniphonphan K, Aramvith S, Chalidabhongse TH (2011) Human action recognition using direction histograms of optical flow. In: 11th International symposium on communications and information technologies (ISCIT), pp 574–579
Chun S, Lee CS (2016) Human action recognition using histogram of motion intensity and direction from multiple views. IET Comput Vis 10:250–256. https://doi.org/10.1049/iet-cvi.2015.0233
Article Google Scholar
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Computer vision and pattern recognition (CVPR), pp 886–893
Bay H, Ess A, Tuytelaars T, Gool L Van (2008) Speeded-up robust features (SURF). Comput Vis Image Underst 110:346–359. https://doi.org/10.1016/j.cviu.2007.09.014
Article Google Scholar
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Comput Vis 60:91–110
Article Google Scholar
Wang H, Klaser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103:60–79
Article MathSciNet Google Scholar
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British machine vision conference, pp 121–124
Ballan L, Bertini M, Bimbo A Del, Seidenari L, Serra G (2012) Effective codebooks for human action representation and classification in unconstrained videos. IEEE Trans Multimed 14:1234–1245. https://doi.org/10.1109/TMM.2012.2191268
Article Google Scholar
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE international conference on computer vision and pattern recognition, pp 1–8
Coniglio C, Meurie C, Lézoray O, Berbineau M (2017) People silhouette extraction from people detection bounding boxes in images. Pattern Recognit Lett 93:182–191. https://doi.org/10.1016/j.patrec.2016.12.014
Article Google Scholar
Zeng W, Wang C, Yang F (2014) Silhouette-based gait recognition via deterministic learning. Pattern Recognit 47:3568–3584. https://doi.org/10.1016/j.patcog.2014.04.014
Article Google Scholar
Willems G, Tuytelaars T, Gool L-V (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: 10th European conference on computer vision. Springer, Berlin, pp 650–663
Google Scholar
Kim H-J, Lee JS, Yang H-S (2007) Human action recognition using a modified convolutional neural network. In: Liu D, Fei S, Hou Z, Zhang H, Sun C (eds) Proceedings of the 4th international symposium on neural networks: part II—advances in neural networks. Springer, Berlin, pp 715–723
Chapter Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th international conference on neural information processing systems. MIT Press, Cambridge, pp 568–576
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition. IEEE Computer Society, Washington, DC, pp 1725–1732
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4305–4314
Wang P, Cao Y, Shen C, Liu L, Shen HT (2016) Temporal pyramid pooling based convolutional neural network for action recognition. IEEE Trans Circuits Syst Video Technol PP:1. https://doi.org/10.1109/tcsvt.2016.2576761
Article Google Scholar
Abrishami Moghaddam H, Taghizadeh Khajoie T, Rouhi AH, Saadatmand-Tarzjan M (2005) Wavelet correlogram: a new approach for image indexing and retrieval. Pattern Recognit 38:2506–2518. https://doi.org/10.1016/j.patcog.2005.05.010
Article Google Scholar
Laptev I (2005) On space-time interest points. Int J Comput Vis 64:107–123. https://doi.org/10.1007/s11263-005-1838-7
Article Google Scholar
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. Int Conf Learn Represent. https://doi.org/10.1016/j.infsof.2008.09.005
Article Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML’15, JMLR.org, pp 448–456
Yu D, Eversole A, Seltzer MLM, Yao K, Kuchaiev O, Zhang Y, Seide F, Huang Z, Guenter B, Wang H, Droppo J, Zweig G, Rossbach C, Gao J, Stolcke A, Currey J, Slaney M, Chen G, Agarwal A, Basoglu C, Padmilac M, Kamenev A, Ivanov V, Cypher S, Parthasarathi H, Mitra B, Peng B, Huang X, Akchurin E, Basoglu C, Chen G, Cyphers S, Droppo J, Eversole A, Guenter B, Hillebrand M, Huang X, Huang Z, Ivanov V, Kamenev A, Kranen P, Kuchaiev O, Manousek W, Orlov A, Padmilac M, Parthasarathi H, Peng B, Reznichenko A, Seide F, Seltzer MLM, Slaney M, Stolcke A, Wang H, Yao K, Yu D (2014) An introduction to computational networks and the computational network toolkit. Microsoft Research, Redmond
Google Scholar
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proceedings of the IEEE international conference on pattern recognition, pp 32–36
Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2012) Spatio-temporal convolutional sparse auto-encoder for sequence classification. In: BMVC, pp 1–12
Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: Daniilidis K, Maragos P, Paragios N (eds) Computer vision—ECCV 2010. Springer, Berlin, pp 140–153
Chapter Google Scholar
Shi Y, Zeng W, Huang T, Wang Y (2015) Learning deep trajectory descriptor for action recognition in videos using deep neural networks. In: 2015 IEEE international conference on multimedia and expo (ICME), pp 1–6
Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space–time shapes. IEEE Trans Pattern Anal Mach Intell 29:2247–2253
Article Google Scholar
Nasiri JA, Moghadam Charkari N, Mozafari K (2014) Energy-based model of least squares twin support vector machines for human action recognition. Sig Process 104:248–257
Article Google Scholar
Sheng B, Yang W, Sun C (2015) Action recognition using direction-dependent feature pairs and non-negative low rank sparse model. Neurocomputing 158:73–80
Article Google Scholar
Dou J, Liu J (2014) Robust human action recognition based on spatio-temporal descriptors and motion temporal templates. Optik (Stuttg) 125:1891–1896
Article Google Scholar
Klaser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: Proceedings of the British machine vision conference
Wang L, Li R, Fang Y (2016) Gradient-layer feature transform for action detection and recognition. J Vis Commun Image Represent 40:159–167. https://doi.org/10.1016/j.jvcir.2016.06.023
Article Google Scholar
Al-Azzo F, Bao C, Taqi AM, Milanova M, Ghassan N (2017) Human actions recognition based on 3D deep neural network. In: 2017 Annual conference on new trends in information and communications technology applications (NTICT), pp 240–246
Soomro K, Zamir AR (2014) Action recognition in realistic sports videos. In: Moeslund TB, Thomas G, Hilton A (eds) Computer vision in sports. Springer, Cham, pp 181–208
Chapter Google Scholar
Lan T, Wang Y, Mori G (2011) Discriminative figure-centric models for joint action localization and recognition. In: 2011 International conference on computer vision, pp 2003–2010
Raptis M, Kokkinos I, Soatto S (2012) Discovering discriminative action parts from mid-level video representations. In: 2012 IEEE conference on computer vision and pattern recognition, pp 1242–1249
Ma S, Zhang J, Ikizler-Cinbis N, Sclaroff S (2013) Action recognition and localization by hierarchical space–time segments. In: 2013 IEEE international conference on computer vision, pp 2744–2751
Yu J, Jeon M, Pedrycz W (2014) Weighted feature trajectories and concatenated bag-of-features for action recognition. Neurocomputing 131:200–207
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
Amin Zare & Arash Sharifi
Faculty of Electrical and Computer Engineering, K.N. Toosi University of Technology, Seyed Khandan, P.O. Box 16315-1355, Tehran, Iran
Hamid Abrishami Moghaddam

Authors

Amin Zare
View author publications
You can also search for this author in PubMed Google Scholar
Hamid Abrishami Moghaddam
View author publications
You can also search for this author in PubMed Google Scholar
Arash Sharifi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hamid Abrishami Moghaddam.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zare, A., Abrishami Moghaddam, H. & Sharifi, A. Video spatiotemporal mapping for human action recognition by convolutional neural network. Pattern Anal Applic 23, 265–279 (2020). https://doi.org/10.1007/s10044-019-00788-1

Download citation

Received: 26 February 2018
Accepted: 30 January 2019
Published: 06 February 2019
Issue Date: February 2020
DOI: https://doi.org/10.1007/s10044-019-00788-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video spatiotemporal mapping for human action recognition by convolutional neural network

Abstract

Access this article

Similar content being viewed by others

BDNet: a method based on forward and backward convolutional networks for action recognition in videos

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Learning multi-temporal-scale deep information for action recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Video spatiotemporal mapping for human action recognition by convolutional neural network

Abstract

Access this article

Similar content being viewed by others

BDNet: a method based on forward and backward convolutional networks for action recognition in videos

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Learning multi-temporal-scale deep information for action recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation