Skip to main content
Log in

Learning Cooperative Neural Modules for Stylized Image Captioning

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Recent progress in stylized image captioning has been achieved through the encoder-decoder framework that generates a sentence in one-pass decoding process. However, it remains difficult for such a decoding process to simultaneously capture the syntactic structure, infer the semantic concepts and express the linguistic styles. Research in psycholinguistics has revealed that the language production process of humans involves multiple stages, starting with several rough concepts and ending with fluent sentences. With this in mind, we propose a novel stylized image captioning approach that generates stylized sentences in a multi-pass decoding process by training three cooperative neural modules under the reinforcement learning paradigm. A low-level neural module called syntax module first generates the overall syntactic structure of the stylized sentence. Next, two high-level neural modules, namely concept module and style module, incorporate the words that describe factual content and the words that express linguistic style, respectively. Since the three modules contribute to different aspects of the stylized sentence, i.e. the fluency, the relevancy of the factual content and the style accuracy, we encourage the modules to specialize in their own tasks by designing different rewards for different actions. We also design an attention mechanism to facilitate the communication between the high-level and low-level modules. With the help of the attention mechanism, the high-level modules are able to take the global structure of the sentence into consideration and maintain the consistency between the factual content and the linguistic style. Evaluations on several public benchmark datasets demonstrate that our method outperforms the existing one-pass decoding methods in terms of multiple different evaluation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. In European conference on computer vision, (pp. 382–398), Springer

  • Andrew Shin, Y.U., & Harada, T. (2016). Image captioning with sentiment terms via weakly-supervised sentiment dataset. In C. Richard, E.R.H. Wilson, W.A.P. Smith (eds). Proceedings of the british machine vision conference (BMVC), (pp 53.1–53.12), BMVA Press

  • Banerjee, S., & Lavie, A. (2005). Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, (pp. 65–72).

  • Chen, C. K., Pan, Z., Liu, M. Y., & Sun, M. (2019). Unsupervised stylish image description generation via domain layer norm. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8151–8158.

    Article  Google Scholar 

  • Chen, T., Zhang, Z., You, Q., Fang, C., Wang, Z., Jin, H., & Luo, J. (2018). “factual”or“emotional”: Stylized image captioning with adaptive learning and attention. In Proceedings of the european conference on computer vision (ECCV), (pp. 519–535).

  • Dethlefs, N., & Cuayáhuitl, H. (2010). Hierarchical reinforcement learning for adaptive text generation. In Proceedings of the 6th international natural language generation conference, association for computational linguistics, (pp. 37–45).

  • Diao, H., Zhang, Y., Ma, L., & Lu, H. (2021). Similarity reasoning and filtration for image-text matching. Technical Report

  • Fu, Z., Tan, X., Peng, N., Zhao, D., & Yan, R. (2018). Style transfer in text: exploration and evaluation. In Thirty-second AAAI conference on artificial intelligence, (pp. 663–670).

  • Gan, C., Gan, Z., He, X., Gao, J., & Deng, L. (2017). Stylenet: Generating attractive visual captions with styles. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 3137–3146).

  • Gu, J., Cai, J., Wang, G., & Chen, T.(2018). Stack-captioning: Coarse-to-fine learning for image captioning. In Thirty-second AAAI conference on artificial intelligence, (pp. 6837–6844).

  • Guo, L., Liu, J., Lu, S., & Lu, H. (2019). Show, tell and polish: Ruminant decoding for image captioning. IEEE Transactions on Multimedia, 22(8), 2149–2162.

  • Guo, L., Liu, J., Yao, P., Li, J., & Lu, H. (2019). Mscap: Multi-style image captioning with unpaired stylized text. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 4204–4213).

  • Guo, L., Liu, J., Zhu, X., He, X., Jiang, J., & Lu, H. (2020). Non-autoregressive image captioning with counterfactuals-critical multi-agent learning. In Proceedings of the twenty-ninth international joint conference on artificial intelligence, (pp. 767–773).

  • Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength natural language processing in python. https://doi.org/10.5281/zenodo.1212303

  • Huang, Q., Gan, Z., Celikyilmaz, A., Wu, D., Wang, J., & He, X. (2019). Hierarchically structured reinforcement learning for topically coherent visual story generation. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8465–8472.

    Article  Google Scholar 

  • Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D., Bernstein, M., & Fei-Fei, L. (2015). Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 3668–3678).

  • Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), association for computational linguistics, (pp. 1746–1751), Doha, Qatar, https://doi.org/10.3115/v1/D14-1181

  • Kingma, D.P., & Ba, J. (2015). Adam: A method for stochastic optimization. In 3rd international conference on learning representations

  • Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Skip-thought vectors. In Advances in neural information processing systems, (pp. 3294–3302).

  • Kong, X., Xin, B., Wang, Y., & Hua, G. (2017). Collaborative deep reinforcement learning for joint object search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1695–1704).

  • Krause, J., Johnson, J., Krishna, R., & Fei-Fei, L. (2017). A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 317–325).

  • Li, X., & Jiang, S. (2019). Know more say less: Image captioning based on scene graphs. IEEE Transactions on Multimedia, 21(8), 2117–2130.

    Article  Google Scholar 

  • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision, (pp. 740–755), Springer.

  • Liu, C., He, S., Liu, K., & Zhao, J. (2019). Vocabulary pyramid network: Multi-pass encoding and decoding with multi-level vocabularies for response generation. In Proceedings of the 57th annual meeting of the association for computational linguistics, (pp. 3774–3783).

  • Mathews, A., Xie, L., & He, X. (2018). Semstyle: Learning to generate stylised image captions using unaligned text. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 8591–8600).

  • Mathews, A.P., Xie, L., & He, X. (2016). Senticap: Generating image descriptions with sentiments. In Thirtieth AAAI conference on artificial intelligence, (pp. 3574–3580).

  • Panait, L., & Luke, S. (2005). Cooperative multi-agent learning: The state of the art. Autonomous Agents and Multi-agent Systems, 11(3), 387–434.

    Article  Google Scholar 

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W.J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, (pp. 311–318).

  • Peng, B., Li, X., Li, L., Gao, J., Celikyilmaz, A., Lee, S., & Wong, K.F. (2017). Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning. In Proceedings of the 2017 conference on empirical methods in natural language processing, (pp. 2231–2240).

  • Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2017). Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 7008–7024).

  • Slevc, L. R. (2011). Saying what’s on your mind: Working memory effects on sentence production. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37(6), 1503.

    Google Scholar 

  • Stolcke, A. (2002). Srilm-an extensible language modeling toolkit. In Proceedings of ICSLP, (pp. 901–904).

  • Sun, X., Lu, W. (2020). Understanding attention for text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, (pp. 3418–3428).

  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge: MIT Press.

    MATH  Google Scholar 

  • Vedantam, R., Lawrence Zitnick, C., Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 4566–4575).

  • Wang, X., Chen, W., Wu, J., Wang, Y.F., Yang Wang, W. (2018). Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 4213–4222).

  • Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3), 229–256.

    MATH  Google Scholar 

  • Wu, L., Xu, M., Wang, J., & Perry, S. (2019). Recall what you see continually using gridlstm in image captioning. IEEE Transactions on Multimedia, 22(3), 808–818.

    Article  Google Scholar 

  • Xia, Y., Tian, F., Wu, L., Lin, J., Qin, T., Yu, N., & Liu, T.Y. (2017). Deliberation networks: Sequence generation beyond one-pass decoding. In Advances in neural information processing systems, (pp. 1784–1794).

  • Xu, N., Zhang, H., Liu, A. A., Nie, W., Su, Y., Nie, J., & Zhang, Y. (2019). Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Transactions on Multimedia, 22(5), 1372–1383.

    Article  Google Scholar 

  • Xu, W., Yu, J., Miao, Z., Wan, L., Tian, Y., Ji, Q. (2020). Deep reinforcement polishing network for video captioning. IEEE Transactions on Multimedia, 23, 1772–1784.

  • Yang, X., Tang, K., Zhang, H., Cai, J. (2019). Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 10685–10694).

  • Zellers, R., Yatskar, M., Thomson, S., Choi, Y. (2018). Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 5831–5840).

  • Zhao, W., Wu, X., & Zhang, X. (2020). Memcap: Memorizing style knowledge for image captioning. In The thirty-fourth AAAI conference on artificial intelligence, (pp. 12984–12992).

  • Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, (pp. 19–27).

Download references

Acknowledgements

If you’d like to thank anyone, place your comments here and remove the percent signs.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xinxiao Wu.

Additional information

Communicated by Svetlana Lazebnik.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported in part by the Natural Science Foundation of China (NSFC) under Grant No 62072041.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, X., Zhao, W. & Luo, J. Learning Cooperative Neural Modules for Stylized Image Captioning. Int J Comput Vis 130, 2305–2320 (2022). https://doi.org/10.1007/s11263-022-01636-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01636-2

Keywords

Navigation