Bidirectional Multimodal Recurrent Neural Networks with Refined Visual Features for Image Captioning

Shu, Yanwu; Zhang, Liyan; Li, Zechao; Tang, Jinhui

doi:10.1007/978-981-10-8530-7_8

Yanwu Shu¹²,
Liyan Zhang¹³,
Zechao Li¹² &
…
Jinhui Tang¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 819))

Included in the following conference series:

International Conference on Internet Multimedia Computing and Service

1447 Accesses
1 Citations

Abstract

Image captioning which aims to automatically describe the content of an image using sentences, has become an attractive task in computer vision and natural language processing domain. Recently, neural network approaches have been proposed and proved to be the most efficient methods for image captioning. However, most of the prior work only considers past semantic context information to generate words in the sentence, lacking the consideration of future textual context. Therefore, in this paper, we propose a bidirectional multimodal Recurrent Neural Network (m-RNN) model which considers both history and future semantic context through a bidirectional recurrent layer. We first employ a pre-trained Convolution Neural Network (CNN) to extract image features and then leverage the bidirectional m-RNN to generate the sentences to describe each input image. Besides, we refine visual features by combining word embedding features and raw image features together to further improve the performance. Experimental results performed on the MS-COCO dataset have demonstrated the superiority of our proposed model compared with the original m-RNN model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 107.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Tang, J., Shu, X., Qi, Q.J., Li, Z., Wang, M., Yan, S., Jain, R.: Tri-clustered tensor completion for social-aware image tag refinement. IEEE Trans. Pattern Anal. Mach. Intell. 39(8), 1662–1674 (2017)
Article Google Scholar
Tang, J., Shu, X., Li, Z., Qi, Q.J., Wang, J.: Generalized deep transfer networks for knowledge propagation in heterogeneous domains. ACM Trans. Multimed. Comput. Commun. Appl. 12(4) (2016)
Google Scholar
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Babytalk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)
Article Google Scholar
Yang, Y., Teo, C.L., Daumé III, H., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 444–454. Association for Computational Linguistics (2011)
Google Scholar
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2
Chapter Google Scholar
Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image descriptions. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1, pp. 359–368. Association for Computational Linguistics (2012)
Google Scholar
Li, Z., Tang, J.: Weakly supervised deep matrix factorization for social image understanding. IEEE Trans. Image Process. 26(1), 276–288 (2017)
Article MathSciNet Google Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: ICLR (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
Wang, C., Yang, H., Bartz, C., Meinel, C.: Image captioning with deep bidirectional LSTMs. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 988–997. ACM (2016)
Google Scholar
Li, Z., Liu, J., Tang, J., Lu, H.: Robust structured subspace learning for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 37(10), 2085–2098 (2015)
Article Google Scholar
Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft COCO captions: data collection and evaluation server (2015). arXiv preprint: arXiv:1504.00325
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments, pp. 228–231 (2005)
Google Scholar
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL-2004 Workshop, vol. 8, Barcelona, Spain (2004)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar

Download references

Acknowledgement

This work was partially supported by the National Natural Science Foundation of China (Grant No. 61522203, 61572252 and 61672285), the Natural Science Foundation of Jiangsu Province (Grant No. BK20140058 and BK20150755).

Author information

Authors and Affiliations

Nanjing University of Science and Technology, Nanjing, China
Yanwu Shu, Zechao Li & Jinhui Tang
Nanjing University of Aeronautics and Astronautics, Nanjing, China
Liyan Zhang

Authors

Yanwu Shu
View author publications
You can also search for this author in PubMed Google Scholar
Liyan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zechao Li
View author publications
You can also search for this author in PubMed Google Scholar
Jinhui Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liyan Zhang .

Editor information

Editors and Affiliations

Multimedia Communications Department, EURECOM, Sophia Antipolis, France
Benoit Huet
Shandong University , Qingdao, China
Liqiang Nie
Hefei University of Technology , Hefei, China
Richang Hong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shu, Y., Zhang, L., Li, Z., Tang, J. (2018). Bidirectional Multimodal Recurrent Neural Networks with Refined Visual Features for Image Captioning. In: Huet, B., Nie, L., Hong, R. (eds) Internet Multimedia Computing and Service. ICIMCS 2017. Communications in Computer and Information Science, vol 819. Springer, Singapore. https://doi.org/10.1007/978-981-10-8530-7_8

Download citation

DOI: https://doi.org/10.1007/978-981-10-8530-7_8
Published: 01 March 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8529-1
Online ISBN: 978-981-10-8530-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics