Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Khowaja, Sunder Ali; Lee, Seok-Lyong

doi:10.1007/s00521-019-04578-y

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Original Article
Published: 28 October 2019

Volume 32, pages 10423–10434, (2020)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

561 Accesses
12 Citations
Explore all metrics

Abstract

Two-stream networks have provided an alternate way of exploiting the spatiotemporal information for action recognition problem. Nevertheless, most of the two-stream variants perform the fusion of homogeneous modalities which cannot efficiently capture the action-motion dynamics from the videos. Moreover, the existing studies cannot extend the streams beyond the number of modalities. To address these limitations, we propose a hybrid and hierarchical fusion (HHF) networks. The hybrid fusion handles non-homogeneous modalities and introduces a cross-modal learning stream for effective modeling of motion dynamics while extending the networks from existing two-stream variants to three and six streams. On the other hand, the hierarchical fusion makes the modalities consistent by modeling long-term temporal information along with the combination of multiple streams to improve the recognition performance. The proposed network architecture comprises of three fusion tiers: the hybrid fusion itself, the long-term fusion pooling layer which models the long-term dynamics from RGB and optical flow modalities, and the adaptive weighting scheme for combining the classification scores from several streams. We show that the hybrid fusion has different representations from the base modalities for training the cross-modal learning stream. We have conducted extensive experiments and shown that the proposed six-stream HHF network outperforms the existing two- and four-stream networks, achieving the state-of-the-art recognition performance, 97.2% and 76.7% accuracies on UCF101 and HMDB51 datasets, respectively, which are widely used in action recognition studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Article 10 June 2021

References

Gan C, Lin M, Yang Y et al (2016) Concepts not alone: exploring pairwise relationships for zero-shot video activity recognition. In: AAAI thirteenth conference on artificial intelligence, pp 3487–3493
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: 2013 IEEE international conference on computer vision. IEEE, pp 3551–3558
Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: 15th International conference on multimedia—MULTIMEDIA’07. ACM Press, New York, NY, USA, pp 357–360
Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23:257–267. https://doi.org/10.1109/34.910878
Article Google Scholar
Bilen H, Fernando B, Gavves E et al (2016) Dynamic image networks for action recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 3034–3042
Khowaja SA, Lee S-L (2019) Semantic image networks for human action recognition, ArXiv Prepr http://arxiv.org/abs/1901.06792 (2019). (to appear in IJCV)
Fernando B, Gavves E, Jose Oramas M et al (2015) Modeling video evolution for action recognition. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 5378–5387
Ng JY-H, Hausknecht M, Vijayanarasimhan S et al (2015) Beyond short snippets: deep networks for video classification. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4694–4702
Tran D, Bourdev L, Fergus R et al (2015) Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV). IEEE, pp 4489–4497
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 1–9
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1933–1941
Ma C-Y, Chen M-H, Kira Z, AlRegib G (2018) TS-LSTM and temporal-inception: exploiting spatiotemporal dynamics for activity recognition. Signal Process Image Commun. https://doi.org/10.1016/j.image.2018.09.003
Article Google Scholar
Wu Z, Wang X, Jiang Y-G et al (2015) Modeling spatial–temporal clues in a hybrid deep learning framework for video classification. In: 23rd ACM international conference on multimedia—MM’15. ACM Press, New York, NY, USA, pp 461–470
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4305–4314
Wang L, Xiong Y, Wang Z et al (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision, pp 20–36
Lan Z, Zhu Y, Hauptmann AG, Newsam S (2017) Deep local video feature for action recognition. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 1219–1225
Diba A, Sharma V, Van Gool L (2017) Deep temporal linear encoding networks. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1541–1550
Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35:221–231. https://doi.org/10.1109/TPAMI.2012.59
Article Google Scholar
Wang X, Gao L, Wang P et al (2018) Two-stream 3-D convNet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimed 20:634–644. https://doi.org/10.1109/TMM.2017.2749159
Article Google Scholar
Bilen H, Fernando B, Gavves E, Vedaldi A (2018) Action recognition with dynamic image networks. IEEE Trans Pattern Anal Mach Intell 40:2799–2813. https://doi.org/10.1109/TPAMI.2017.2769085
Article Google Scholar
Zhang G, Liu J, Li H et al (2017) Joint human detection and head pose estimation via multistream networks for RGB-D videos. IEEE Signal Process Lett 24:1666–1670. https://doi.org/10.1109/LSP.2017.2731952
Article Google Scholar
Wu Z, Jiang Y-G, Wang X, et al (2016) Multi-stream multi-class fusion of deep networks for video classification. In: ACM on multimedia conference—MM’16. ACM Press, New York, NY, USA, pp 791–800
Donahue J, Hendricks LA, Rohrbach M et al (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39:677–691. https://doi.org/10.1109/TPAMI.2016.2599174
Article Google Scholar
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: Thirty-first association for the advancement of artificial intelligence (AAAI), pp 4278–4284
Khowaja SA, Yahya BN, Lee S-L (2017) Hierarchical classification method based on selective learning of slacked hierarchy for activity recognition systems. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2017.06.040
Article Google Scholar
Ryoo MS, Rothrock B, Matthies L (2015) Pooled motion features for first-person videos. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 896–904
Kwon H, Kim Y, Lee JS, Cho M (2018) First person action recognition via two-stream ConvNet with long-term fusion pooling. Pattern Recognit Lett 112:161–167. https://doi.org/10.1016/j.patrec.2018.07.011
Article Google Scholar
Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint. arXiv:1212.0402
Kuehne H, Jhuang H, Garrote E et al (2011) HMDB: a large video database for human motion recognition. In: 2011 International conference on computer vision. IEEE, pp 2556–2563
Russakovsky O, Deng J, Su H et al (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115:211–252. https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Proceedings of the 30th international conference on machine learning, pp 1310–1318
Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40:1510–1517. https://doi.org/10.1109/TPAMI.2017.2712608
Article Google Scholar
Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV). IEEE, pp 4597–4605
Carreira J, Zisserman A (2017) Quo Vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4724–4733
Wang Y, Long M, Wang J, Yu PS (2017) Spatiotemporal pyramid network for video action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 2097–2106

Download references

Acknowledgements

This research was supported by Hankuk University of Foreign Studies Research Fund (Grant No. 2019-1) and also supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2018R1D1A1B07049113).

Author information

Authors and Affiliations

Department of Industrial and Management Engineering, Hankuk University of Foreign Studies, Global Campus, Yongin-si, South Korea
Sunder Ali Khowaja & Seok-Lyong Lee

Authors

Sunder Ali Khowaja
View author publications
You can also search for this author in PubMed Google Scholar
Seok-Lyong Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seok-Lyong Lee.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khowaja, S.A., Lee, SL. Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition. Neural Comput & Applic 32, 10423–10434 (2020). https://doi.org/10.1007/s00521-019-04578-y

Download citation

Received: 20 May 2019
Accepted: 17 October 2019
Published: 28 October 2019
Issue Date: July 2020
DOI: https://doi.org/10.1007/s00521-019-04578-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation