Learning Cooperative Neural Modules for Stylized Image Captioning

Wu, Xinxiao; Zhao, Wentian; Luo, Jiebo

doi:10.1007/s11263-022-01636-2

Learning Cooperative Neural Modules for Stylized Image Captioning

Published: 22 July 2022

Volume 130, pages 2305–2320, (2022)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

551 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Recent progress in stylized image captioning has been achieved through the encoder-decoder framework that generates a sentence in one-pass decoding process. However, it remains difficult for such a decoding process to simultaneously capture the syntactic structure, infer the semantic concepts and express the linguistic styles. Research in psycholinguistics has revealed that the language production process of humans involves multiple stages, starting with several rough concepts and ending with fluent sentences. With this in mind, we propose a novel stylized image captioning approach that generates stylized sentences in a multi-pass decoding process by training three cooperative neural modules under the reinforcement learning paradigm. A low-level neural module called syntax module first generates the overall syntactic structure of the stylized sentence. Next, two high-level neural modules, namely concept module and style module, incorporate the words that describe factual content and the words that express linguistic style, respectively. Since the three modules contribute to different aspects of the stylized sentence, i.e. the fluency, the relevancy of the factual content and the style accuracy, we encourage the modules to specialize in their own tasks by designing different rewards for different actions. We also design an attention mechanism to facilitate the communication between the high-level and low-level modules. With the help of the attention mechanism, the high-level modules are able to take the global structure of the sentence into consideration and maintain the consistency between the factual content and the linguistic style. Evaluations on several public benchmark datasets demonstrate that our method outperforms the existing one-pass decoding methods in terms of multiple different evaluation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Plugging Stylized Controls in Open-Stylized Image Captioning

Improving Stylized Image Captioning with Better Use of Transformer

Transformer with Prior Language Knowledge for Image Captioning

References

Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. In European conference on computer vision, (pp. 382–398), Springer
Andrew Shin, Y.U., & Harada, T. (2016). Image captioning with sentiment terms via weakly-supervised sentiment dataset. In C. Richard, E.R.H. Wilson, W.A.P. Smith (eds). Proceedings of the british machine vision conference (BMVC), (pp 53.1–53.12), BMVA Press
Banerjee, S., & Lavie, A. (2005). Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, (pp. 65–72).
Chen, C. K., Pan, Z., Liu, M. Y., & Sun, M. (2019). Unsupervised stylish image description generation via domain layer norm. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8151–8158.
Article Google Scholar
Chen, T., Zhang, Z., You, Q., Fang, C., Wang, Z., Jin, H., & Luo, J. (2018). “factual”or“emotional”: Stylized image captioning with adaptive learning and attention. In Proceedings of the european conference on computer vision (ECCV), (pp. 519–535).
Dethlefs, N., & Cuayáhuitl, H. (2010). Hierarchical reinforcement learning for adaptive text generation. In Proceedings of the 6th international natural language generation conference, association for computational linguistics, (pp. 37–45).
Diao, H., Zhang, Y., Ma, L., & Lu, H. (2021). Similarity reasoning and filtration for image-text matching. Technical Report
Fu, Z., Tan, X., Peng, N., Zhao, D., & Yan, R. (2018). Style transfer in text: exploration and evaluation. In Thirty-second AAAI conference on artificial intelligence, (pp. 663–670).
Gan, C., Gan, Z., He, X., Gao, J., & Deng, L. (2017). Stylenet: Generating attractive visual captions with styles. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 3137–3146).
Gu, J., Cai, J., Wang, G., & Chen, T.(2018). Stack-captioning: Coarse-to-fine learning for image captioning. In Thirty-second AAAI conference on artificial intelligence, (pp. 6837–6844).
Guo, L., Liu, J., Lu, S., & Lu, H. (2019). Show, tell and polish: Ruminant decoding for image captioning. IEEE Transactions on Multimedia, 22(8), 2149–2162.
Guo, L., Liu, J., Yao, P., Li, J., & Lu, H. (2019). Mscap: Multi-style image captioning with unpaired stylized text. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 4204–4213).
Guo, L., Liu, J., Zhu, X., He, X., Jiang, J., & Lu, H. (2020). Non-autoregressive image captioning with counterfactuals-critical multi-agent learning. In Proceedings of the twenty-ninth international joint conference on artificial intelligence, (pp. 767–773).
Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength natural language processing in python. https://doi.org/10.5281/zenodo.1212303
Huang, Q., Gan, Z., Celikyilmaz, A., Wu, D., Wang, J., & He, X. (2019). Hierarchically structured reinforcement learning for topically coherent visual story generation. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8465–8472.
Article Google Scholar
Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D., Bernstein, M., & Fei-Fei, L. (2015). Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 3668–3678).
Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), association for computational linguistics, (pp. 1746–1751), Doha, Qatar, https://doi.org/10.3115/v1/D14-1181
Kingma, D.P., & Ba, J. (2015). Adam: A method for stochastic optimization. In 3rd international conference on learning representations
Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Skip-thought vectors. In Advances in neural information processing systems, (pp. 3294–3302).
Kong, X., Xin, B., Wang, Y., & Hua, G. (2017). Collaborative deep reinforcement learning for joint object search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1695–1704).
Krause, J., Johnson, J., Krishna, R., & Fei-Fei, L. (2017). A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 317–325).
Li, X., & Jiang, S. (2019). Know more say less: Image captioning based on scene graphs. IEEE Transactions on Multimedia, 21(8), 2117–2130.
Article Google Scholar
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision, (pp. 740–755), Springer.
Liu, C., He, S., Liu, K., & Zhao, J. (2019). Vocabulary pyramid network: Multi-pass encoding and decoding with multi-level vocabularies for response generation. In Proceedings of the 57th annual meeting of the association for computational linguistics, (pp. 3774–3783).
Mathews, A., Xie, L., & He, X. (2018). Semstyle: Learning to generate stylised image captions using unaligned text. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 8591–8600).
Mathews, A.P., Xie, L., & He, X. (2016). Senticap: Generating image descriptions with sentiments. In Thirtieth AAAI conference on artificial intelligence, (pp. 3574–3580).
Panait, L., & Luke, S. (2005). Cooperative multi-agent learning: The state of the art. Autonomous Agents and Multi-agent Systems, 11(3), 387–434.
Article Google Scholar
Papineni, K., Roukos, S., Ward, T., & Zhu, W.J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, (pp. 311–318).
Peng, B., Li, X., Li, L., Gao, J., Celikyilmaz, A., Lee, S., & Wong, K.F. (2017). Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning. In Proceedings of the 2017 conference on empirical methods in natural language processing, (pp. 2231–2240).
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2017). Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 7008–7024).
Slevc, L. R. (2011). Saying what’s on your mind: Working memory effects on sentence production. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37(6), 1503.
Google Scholar
Stolcke, A. (2002). Srilm-an extensible language modeling toolkit. In Proceedings of ICSLP, (pp. 901–904).
Sun, X., Lu, W. (2020). Understanding attention for text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, (pp. 3418–3428).
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge: MIT Press.
MATH Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 4566–4575).
Wang, X., Chen, W., Wu, J., Wang, Y.F., Yang Wang, W. (2018). Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 4213–4222).
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3), 229–256.
MATH Google Scholar
Wu, L., Xu, M., Wang, J., & Perry, S. (2019). Recall what you see continually using gridlstm in image captioning. IEEE Transactions on Multimedia, 22(3), 808–818.
Article Google Scholar
Xia, Y., Tian, F., Wu, L., Lin, J., Qin, T., Yu, N., & Liu, T.Y. (2017). Deliberation networks: Sequence generation beyond one-pass decoding. In Advances in neural information processing systems, (pp. 1784–1794).
Xu, N., Zhang, H., Liu, A. A., Nie, W., Su, Y., Nie, J., & Zhang, Y. (2019). Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Transactions on Multimedia, 22(5), 1372–1383.
Article Google Scholar
Xu, W., Yu, J., Miao, Z., Wan, L., Tian, Y., Ji, Q. (2020). Deep reinforcement polishing network for video captioning. IEEE Transactions on Multimedia, 23, 1772–1784.
Yang, X., Tang, K., Zhang, H., Cai, J. (2019). Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 10685–10694).
Zellers, R., Yatskar, M., Thomson, S., Choi, Y. (2018). Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 5831–5840).
Zhao, W., Wu, X., & Zhang, X. (2020). Memcap: Memorizing style knowledge for image captioning. In The thirty-fourth AAAI conference on artificial intelligence, (pp. 12984–12992).
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, (pp. 19–27).

Download references

Acknowledgements

If you’d like to thank anyone, place your comments here and remove the percent signs.

Author information

Authors and Affiliations

Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, Beijing, China
Xinxiao Wu & Wentian Zhao
Department of Computer Science, University of Rochester, Rochester, NY, 14627, USA
Jiebo Luo

Authors

Xinxiao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Wentian Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jiebo Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xinxiao Wu.

Additional information

Communicated by Svetlana Lazebnik.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported in part by the Natural Science Foundation of China (NSFC) under Grant No 62072041.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, X., Zhao, W. & Luo, J. Learning Cooperative Neural Modules for Stylized Image Captioning. Int J Comput Vis 130, 2305–2320 (2022). https://doi.org/10.1007/s11263-022-01636-2

Download citation

Received: 12 July 2021
Accepted: 26 May 2022
Published: 22 July 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s11263-022-01636-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Cooperative Neural Modules for Stylized Image Captioning

Abstract

Access this article

Similar content being viewed by others

Plugging Stylized Controls in Open-Stylized Image Captioning

Improving Stylized Image Captioning with Better Use of Transformer

Transformer with Prior Language Knowledge for Image Captioning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning Cooperative Neural Modules for Stylized Image Captioning

Abstract

Access this article

Similar content being viewed by others

Plugging Stylized Controls in Open-Stylized Image Captioning

Improving Stylized Image Captioning with Better Use of Transformer

Transformer with Prior Language Knowledge for Image Captioning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation