Skip to main content

Attention-Based Bidirectional Recurrent Neural Networks for Description Generation of Videos

  • Conference paper
  • First Online:
Cloud Computing and Security (ICCCS 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11068))

Included in the following conference series:

  • 1527 Accesses

Abstract

Describing videos in human language is of vital importance in many applications, such as managing massive videos on line and providing descriptive video service (DVS) for blind people. In order to further promote existing video description frameworks, this paper presents an end-to-end deep learning model incorporating Convolutional Neural Networks (CNNs) and Bidirectional Recurrent Neural Networks (BiRNNs) based on a multimodal attention mechanism. Firstly, the model produces richer video representations, including image feature, motion feature and audio feature, than other similar researches. Secondly, BiRNNs model encodes these features in both forward and backward directions. Finally, an attention-based decoder translates sequential outputs of encoder to sequential words. The model is evaluated on Microsoft Research Video Description Corpus (MSVD) dataset. The results demonstrate the necessity of combining BiRNNs with a multimodal attention mechanism and the superiority of this model over other state-of-the-art methods conducted on this dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Kulkarni, G., Premraj, V., Ordonez, V.: BabyTalk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)

    Article  Google Scholar 

  2. Chen, X., Zitnick, C.L.: Mind’s eye: a recurrent visual representation for image caption generation. In: IEEE Computer Vision and Pattern Recognition, pp. 2422–2431 (2015)

    Google Scholar 

  3. Rohrbach, M., Qiu, W., Titov, I., et al.: Translating video content to natural language descriptions. In: IEEE International Conference on Computer Vision, pp. 433–440. IEEE Computer Society (2013)

    Google Scholar 

  4. Venugopalan, S., Xu, H., Donahue, J., et al.: Translating videos to natural language using deep recurrent neural networks. Comput. Sci. (2014)

    Google Scholar 

  5. Venugopalan, S., Rohrbach, M., Donahue, J., et al.: Sequence to sequence - video to text. In: IEEE International Conference on Computer Vision, pp. 4534–4542. IEEE (2015)

    Google Scholar 

  6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778 (2015)

    Google Scholar 

  7. Tran, D., Bourdev, L., Fergus, R., et al.: C3D: Generic Features for Video Analysis. Eprint Arxiv (2014)

    Google Scholar 

  8. Peris, A., Bolanos, M., Radeva, P., Casacuberta, F.: Video description using bidirectional recurrent neural networks. In: Villa, A., Masulli, P., Rivero, A. (eds.) ICANN 2016. LNCS, vol. 9887, pp. 3–11. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44781-0_1

    Chapter  Google Scholar 

  9. Yi, B., Yang, Y., Shen, F., et al.: Bidirectional long-short term memory for video description. In: ACM on Multimedia Conference, pp. 436–440. ACM (2016)

    Google Scholar 

  10. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. Comput. Sci.(2014)

    Google Scholar 

  11. Yao, L., Torabi, A., Cho, K., et al.: Video description generation incorporating spatio-temporal features and a soft-attention mechanism. Eprint Arxiv 53, 199–211 (2015)

    Google Scholar 

  12. Barbu, A., Bridge, A., Burchill, Z., et al.: Video in sentences out. In: Twenty-Eighth Conference on Uncertainty in Artificial Intelligence. arXiv, 274–283 (2012)

    Google Scholar 

  13. Yu, H., Wang, J., Huang, Z., et al.: Video paragraph captioning using hierarchical recurrent neural networks. In: Computer Vision and Pattern Recognition, pp. 4584–4593. IEEE (2016)

    Google Scholar 

  14. Cho, K., Van Merrienboer, B., Bahdanau, D., et al.: On the properties of neural machine translation: encoder-decoder approaches. Comput. Sci. (2014)

    Google Scholar 

  15. Venugopalan, S., Hendricks, L.A., Mooney, R., et al.: Improving LSTM-based video description with linguistic knowledge mined from text. In: Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1961–1966 (2016)

    Google Scholar 

  16. Xu, H., Venugopalan, S., Ramanishka, V., et al.: A multi-scale multiple instance video description network. Comput. Sci. 6738, 272–279 (2015)

    Google Scholar 

  17. Bin Y, Yang Y, Shen F, et al. Bidirectional long-short term memory for video description. In: ACM on Multimedia Conference, pp. 436–440 (2016)

    Google Scholar 

  18. Pasunuru, R., Bansal, M.: Multi-task video captioning with video and entailment generation. In: Meeting of the Association for Computational Linguistics, pp. 1273–1283 (2017)

    Google Scholar 

  19. Hershey, S., Chaudhuri, S., Ellis, D.P.W., et al.: CNN architectures for large-scale audio classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing. 2379–190X (2017)

    Google Scholar 

  20. Abdel-Hamid, O., Mohamed, A.R., Jiang, H., et al.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014)

    Article  Google Scholar 

  21. Jin, Q., Chen, J., Chen, S., et al.: Describing videos using multi-modal fusion. In: ACM on Multimedia Conference, pp. 1087–1091. ACM (2016)

    Google Scholar 

  22. Ramanishka, V., Das, A., Dong, H.P., et al.: Multimodal video description. In: ACM on Multimedia Conference, pp. 1092–1096. ACM (2016)

    Google Scholar 

  23. D’Angelo, E., Paratte, J., Puy, G., et al.: Fast TV-L1 optical flow for interactivity. In: IEEE International Conference on Image Processing, pp. 1885–1888. IEEE (2011)

    Google Scholar 

  24. Tran, D., Bourdev, L., Fergus, R., et al.: Learning Spatiotemporal Features with 3D Convolutional Networks. eprint arXiv:1412.0767 (2014)

  25. Giannakopoulos, T.: pyAudioAnalysis: an open-source python library for audio signal analysis. Plos One 10(12), e0144610 (2015)

    Article  Google Scholar 

  26. Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised Learning of Video Representations using LSTMs. eprint arXiv:1502.04681 (2015)

  27. Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: The Workshop on Statistical Machine Translation, pp. 376–380 (2014)

    Google Scholar 

  28. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. Meeting on Association for Computational Linguistics 4, 311–318 (2002)

    Google Scholar 

  29. Zeiler, M.D.: ADADELTA: an adaptive learning rate method. Comput. Sci. (2012)

    Google Scholar 

Download references

Acknowledgments

This work was supported by Research and Industrialization for Intelligent Video Processing Technology based on GPUs Parallel Computing of the Science and Technology Supported Program of Jiangsu Province (BY2016003-11) and the Application platform and Industrialization for efficient cloud computing for Big data of the Science and Technology Supported Program of Jiangsu Province (BA2015052).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaotong Du .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Du, X., Yuan, J., Liu, H. (2018). Attention-Based Bidirectional Recurrent Neural Networks for Description Generation of Videos. In: Sun, X., Pan, Z., Bertino, E. (eds) Cloud Computing and Security. ICCCS 2018. Lecture Notes in Computer Science(), vol 11068. Springer, Cham. https://doi.org/10.1007/978-3-030-00021-9_40

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00021-9_40

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00020-2

  • Online ISBN: 978-3-030-00021-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics