Action Recognition Using Visual Attention with Reinforcement Learning

Li, Hongyang; Chen, Jun; Hu, Ruimin; Yu, Mei; Chen, Huafeng; Xu, Zengmin

doi:10.1007/978-3-030-05716-9_30

Hongyang Li^19,21,
Jun Chen^19,20,
Ruimin Hu^19,20,
Mei Yu²¹,
Huafeng Chen²² &
…
Zengmin Xu¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11296))

Included in the following conference series:

International Conference on Multimedia Modeling

2558 Accesses
11 Citations

Abstract

Human action recognition in videos is a challenging and significant task with a broad range of applications. The advantage of the visual attention mechanism is that it can effectively reduce noise interference by focusing on the relevant parts of the image and ignoring the irrelevant part. We propose a deep visual attention model with reinforcement learning for this task. We use Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) units as a learning agent. The agent interact with video and decides both where to look next frame and where to locate the most relevant region of the selected video frame. REINFORCE method is used to learn the agent’s decision policy and back-propagation method is used to train the action classifier. The experimental results demonstrate that this glimpse window can focus on important clues. Our model achieves significant performance improvement on the action recognition datasets: UCF101 and HMDB51.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. In: ICLR (2015)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)
Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)
Google Scholar
Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal residual networks for video action recognition. In: NIPS (2016)
Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016)
Google Scholar
Girdhar, R., Ramanan, D.: Attentional pooling for action recognition. In: NIPS (2017)
Google Scholar
Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.: ActionVLAD: learning spatio-temporal aggregation for action classification. In: CVPR (2017)
Google Scholar
Ji, Y., Yang, Y., Xu, X., Shen, H.T.: One-shot learning based pattern transition map for action early recognition. Signal Process. 143, 364–370 (2018)
Article Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)
Google Scholar
Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., Snoek, C.G.: VideoLSTM convolves, attends and flows for action recognition. Comput. Vis. Image Underst. 166, 41–50 (2018)
Article Google Scholar
Long, X., Gan, C., de Melo, G., Wu, J., Liu, X., Wen, S.: Attention clusters: purely attention based local feature integration for video classification. In: CVPR (2018)
Google Scholar
Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. In: NIPS (2014)
Google Scholar
Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR (2015)
Google Scholar
Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention. In: ICLR (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: ICML (2015)
Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)
Google Scholar
Wang, K., Wang, X., Lin, L., Wang, M., Zuo, W.: 3D human activity recognition with reconfigurable convolutional neural networks (2014)
Google Scholar
Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep convolutional descriptors. In: CVPR (2015)
Google Scholar
Wang, L., Li, W., Li, W., Van Gool, L.: Appearance-and-relation networks for video classification. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)
Google Scholar
Xu, Z., Hu, R., Chen, J., Chen, H., Li, H.: Global contrast based salient region boundary sampling for action recognition. In: Tian, Q., Sebe, N., Qi, G.-J., Huet, B., Hong, R., Liu, X. (eds.) MMM 2016. LNCS, vol. 9516, pp. 187–198. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-27671-7_16
Chapter Google Scholar
Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR (2016)
Google Scholar
Zhang, M., Yang, Y., Ji, Y., Xie, N., Shen, F.: Recurrent attention network using spatial-temporal relations for action recognition. Signal Process. 145, 137–145 (2018)
Article Google Scholar

Download references

Acknowledgement

The research was supported by the National Nature Science Foundation of China (61671336, U1611461, U1736206), Technology Research Program of Ministry of Public Security (2016JSYJA12), Hubei Province Technological Innovation Major Project (2016AAA015, 2017AAA123), Hubei Provincial Education Department Project (16Q070), Nature Science Foundation of Jiangsu Province (BK20160386).

Author information

Authors and Affiliations

National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China
Hongyang Li, Jun Chen, Ruimin Hu & Zengmin Xu
Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, China
Jun Chen & Ruimin Hu
College of Computer and Information Technology, China Three Gorges University, Yichang, China
Hongyang Li & Mei Yu
Jingchu University of Technology, Jingmen, China
Huafeng Chen

Authors

Hongyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Jun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ruimin Hu
View author publications
You can also search for this author in PubMed Google Scholar
Mei Yu
View author publications
You can also search for this author in PubMed Google Scholar
Huafeng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zengmin Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongyang Li .

Editor information

Editors and Affiliations

Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Ioannis Kompatsiaris
EURECOM, Sophia Antipolis, France
Benoit Huet
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Vasileios Mezaris
Dublin City University, Dublin, Ireland
Cathal Gurrin
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Stefanos Vrochidis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, H., Chen, J., Hu, R., Yu, M., Chen, H., Xu, Z. (2019). Action Recognition Using Visual Attention with Reinforcement Learning. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, WH., Vrochidis, S. (eds) MultiMedia Modeling. MMM 2019. Lecture Notes in Computer Science(), vol 11296. Springer, Cham. https://doi.org/10.1007/978-3-030-05716-9_30

Download citation

DOI: https://doi.org/10.1007/978-3-030-05716-9_30
Published: 11 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05715-2
Online ISBN: 978-3-030-05716-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics