Attention-Based Bidirectional Recurrent Neural Networks for Description Generation of Videos

Du, Xiaotong; Yuan, Jiabin; Liu, Hu

doi:10.1007/978-3-030-00021-9_40

Xiaotong Du¹⁶,
Jiabin Yuan¹⁶ &
Hu Liu¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11068))

Included in the following conference series:

International Conference on Cloud Computing and Security

1527 Accesses

Abstract

Describing videos in human language is of vital importance in many applications, such as managing massive videos on line and providing descriptive video service (DVS) for blind people. In order to further promote existing video description frameworks, this paper presents an end-to-end deep learning model incorporating Convolutional Neural Networks (CNNs) and Bidirectional Recurrent Neural Networks (BiRNNs) based on a multimodal attention mechanism. Firstly, the model produces richer video representations, including image feature, motion feature and audio feature, than other similar researches. Secondly, BiRNNs model encodes these features in both forward and backward directions. Finally, an attention-based decoder translates sequential outputs of encoder to sequential words. The model is evaluated on Microsoft Research Video Description Corpus (MSVD) dataset. The results demonstrate the necessity of combining BiRNNs with a multimodal attention mechanism and the superiority of this model over other state-of-the-art methods conducted on this dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kulkarni, G., Premraj, V., Ordonez, V.: BabyTalk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)
Article Google Scholar
Chen, X., Zitnick, C.L.: Mind’s eye: a recurrent visual representation for image caption generation. In: IEEE Computer Vision and Pattern Recognition, pp. 2422–2431 (2015)
Google Scholar
Rohrbach, M., Qiu, W., Titov, I., et al.: Translating video content to natural language descriptions. In: IEEE International Conference on Computer Vision, pp. 433–440. IEEE Computer Society (2013)
Google Scholar
Venugopalan, S., Xu, H., Donahue, J., et al.: Translating videos to natural language using deep recurrent neural networks. Comput. Sci. (2014)
Google Scholar
Venugopalan, S., Rohrbach, M., Donahue, J., et al.: Sequence to sequence - video to text. In: IEEE International Conference on Computer Vision, pp. 4534–4542. IEEE (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778 (2015)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., et al.: C3D: Generic Features for Video Analysis. Eprint Arxiv (2014)
Google Scholar
Peris, A., Bolanos, M., Radeva, P., Casacuberta, F.: Video description using bidirectional recurrent neural networks. In: Villa, A., Masulli, P., Rivero, A. (eds.) ICANN 2016. LNCS, vol. 9887, pp. 3–11. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44781-0_1
Chapter Google Scholar
Yi, B., Yang, Y., Shen, F., et al.: Bidirectional long-short term memory for video description. In: ACM on Multimedia Conference, pp. 436–440. ACM (2016)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. Comput. Sci.(2014)
Google Scholar
Yao, L., Torabi, A., Cho, K., et al.: Video description generation incorporating spatio-temporal features and a soft-attention mechanism. Eprint Arxiv 53, 199–211 (2015)
Google Scholar
Barbu, A., Bridge, A., Burchill, Z., et al.: Video in sentences out. In: Twenty-Eighth Conference on Uncertainty in Artificial Intelligence. arXiv, 274–283 (2012)
Google Scholar
Yu, H., Wang, J., Huang, Z., et al.: Video paragraph captioning using hierarchical recurrent neural networks. In: Computer Vision and Pattern Recognition, pp. 4584–4593. IEEE (2016)
Google Scholar
Cho, K., Van Merrienboer, B., Bahdanau, D., et al.: On the properties of neural machine translation: encoder-decoder approaches. Comput. Sci. (2014)
Google Scholar
Venugopalan, S., Hendricks, L.A., Mooney, R., et al.: Improving LSTM-based video description with linguistic knowledge mined from text. In: Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1961–1966 (2016)
Google Scholar
Xu, H., Venugopalan, S., Ramanishka, V., et al.: A multi-scale multiple instance video description network. Comput. Sci. 6738, 272–279 (2015)
Google Scholar
Bin Y, Yang Y, Shen F, et al. Bidirectional long-short term memory for video description. In: ACM on Multimedia Conference, pp. 436–440 (2016)
Google Scholar
Pasunuru, R., Bansal, M.: Multi-task video captioning with video and entailment generation. In: Meeting of the Association for Computational Linguistics, pp. 1273–1283 (2017)
Google Scholar
Hershey, S., Chaudhuri, S., Ellis, D.P.W., et al.: CNN architectures for large-scale audio classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing. 2379–190X (2017)
Google Scholar
Abdel-Hamid, O., Mohamed, A.R., Jiang, H., et al.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014)
Article Google Scholar
Jin, Q., Chen, J., Chen, S., et al.: Describing videos using multi-modal fusion. In: ACM on Multimedia Conference, pp. 1087–1091. ACM (2016)
Google Scholar
Ramanishka, V., Das, A., Dong, H.P., et al.: Multimodal video description. In: ACM on Multimedia Conference, pp. 1092–1096. ACM (2016)
Google Scholar
D’Angelo, E., Paratte, J., Puy, G., et al.: Fast TV-L1 optical flow for interactivity. In: IEEE International Conference on Image Processing, pp. 1885–1888. IEEE (2011)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., et al.: Learning Spatiotemporal Features with 3D Convolutional Networks. eprint arXiv:1412.0767 (2014)
Giannakopoulos, T.: pyAudioAnalysis: an open-source python library for audio signal analysis. Plos One 10(12), e0144610 (2015)
Article Google Scholar
Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised Learning of Video Representations using LSTMs. eprint arXiv:1502.04681 (2015)
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: The Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. Meeting on Association for Computational Linguistics 4, 311–318 (2002)
Google Scholar
Zeiler, M.D.: ADADELTA: an adaptive learning rate method. Comput. Sci. (2012)
Google Scholar

Download references

Acknowledgments

This work was supported by Research and Industrialization for Intelligent Video Processing Technology based on GPUs Parallel Computing of the Science and Technology Supported Program of Jiangsu Province (BY2016003-11) and the Application platform and Industrialization for efficient cloud computing for Big data of the Science and Technology Supported Program of Jiangsu Province (BA2015052).

Author information

Authors and Affiliations

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
Xiaotong Du, Jiabin Yuan & Hu Liu

Authors

Xiaotong Du
View author publications
You can also search for this author in PubMed Google Scholar
Jiabin Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Hu Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaotong Du .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Xingming Sun
Nanjing University of Information Science and Technology, Nanjing, China
Zhaoqing Pan
Department of Computer Science, Purdue University, West Lafayette, IN, USA
Elisa Bertino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Du, X., Yuan, J., Liu, H. (2018). Attention-Based Bidirectional Recurrent Neural Networks for Description Generation of Videos. In: Sun, X., Pan, Z., Bertino, E. (eds) Cloud Computing and Security. ICCCS 2018. Lecture Notes in Computer Science(), vol 11068. Springer, Cham. https://doi.org/10.1007/978-3-030-00021-9_40

Download citation

DOI: https://doi.org/10.1007/978-3-030-00021-9_40
Published: 26 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00020-2
Online ISBN: 978-3-030-00021-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics