An encoder-decoder based framework for hindi image caption generation

Singh, Alok; Singh, Thoudam Doren; Bandyopadhyay, Sivaji

doi:10.1007/s11042-021-11106-5

An encoder-decoder based framework for hindi image caption generation

1166: Advances of machine learning in data analytics and visual information processing
Published: 09 July 2021

Volume 80, pages 35721–35740, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

896 Accesses
14 Citations
1 Altmetric
Explore all metrics

Abstract

In recent times, research activity on image caption generation has attracted several researchers. The present work attempt to address the problem of Hindi image caption generation using Hindi Visual genome dataset. Hindi is the official and most spoken language in India. In a linguistically diverse country like India, it is essential to provide a means that can help the people to understand the visual entities in their native languages. In this paper, an encoder-decoder based architecture is proposed where Convolutional Neural Network (CNN) is employed for encoding visual features of an image and stacked Long Short-Term Memory (sLSTM) in combination with both uni-directional LSTM and bi-directional LSTM for generating the captions in Hindi. For encoding the visual feature representation of an image, V GG19 based pre-trained model is used and sLSTM architecture is employed for caption generation at the decoder side. The model is tested over Hindi visual genome dataset to validate the proposed approach’s performance and cross-verification is carried out for English captions with Flickr dataset. The experimental results of the proposed approach manifest that the model is qualitatively and quantitatively better than state-of-the-art approaches for Hindi caption generation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generation and Evaluation of Hindi Image Captions of Visual Genome

Generation of Image Caption Using CNN-LSTM Based Approach

Performance Analysis of Image Caption Generation Techniques Using CNN-Based Encoder–Decoder Architecture

Notes

References

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Dhir R, Mishra SK, Saha S, Bhattacharyya P (2019) A deep attention based framework for image caption generation in hindi language. Comput y Sist 23(3)
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision, pp 15–29. Springer
Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(9):1627–1645. https://doi.org/10.1109/TPAMI.2009.167
Article Google Scholar
Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: European conference on computer vision, pp 529–545. Springer
He X, Deng L (2017) Deep learning for image-to-text generation: a technical overview. IEEE Signal Proc Mag 34(6):109–116
Article Google Scholar
Hironobu YM, Takahashi H, Oka R (1999) Image-to-word transformation based on dividing and vector quantizing images with words. In: Boltzmann machines, neural networks, pp 405409
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computation 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899
Article MathSciNet Google Scholar
Hossain M, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR) 51(6):118
Article Google Scholar
Isozaki H, Hirao T, Duh K, Sudoh K, Tsukada H (2010) Automatic evaluation of translation quality for distant language pairs. In: Proceedings of the 2010 conference on empirical methods in natural language processing, pp 944–952. Association for Computational Linguistics, Cambridge. https://www.aclweb.org/anthology/D10-1092
Jaffe A (2017) Generating image descriptions using multilingual data. In: Proceedings of the second conference on machine translation, pp 458–464
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision, pp 2407–2415
Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: Proceedings of the 31st international conference on international conference on machine learning - vol 32, ICML’14, pp II–595–II–603. JMLR.org. http://dl.acm.org/citation.cfm?id=3044805.3044959
Laskar SR, Singh RP, Pakray P, Bandyopadhyay S (2019) English to Hindi multi-modal neural machine translation and Hindi image captioning. In: Proceedings of the 6th workshop on asian translation. Association for Computational Linguistics, Hong Kong, pp 62–67. https://doi.org/10.18653/v1/D19-5205. https://www.aclweb.org/anthology/D19-5205
Liu M, Hu H, Li L, Yu Y, Guan W (2020) Chinese image caption generation via visual attention and topic modeling. IEEE Transactions on Cybernetics
Ma S, Han Y (2016) Describing images by feeding lstm with structural words. In: 2016 IEEE international conference on multimedia and expo (ICME), pp 1–6. https://doi.org/10.1109/ICME.2016.7552883
Ordonez V, Kulkarni G, Berg TL (2011) Im2text: describing images using 1 million captioned photographs. In: Advances in neural information processing systems, pp 1143–1151
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, ACL ’02, pp 311–318. Association for Computational Linguistics, Stroudsburg. https://doi.org/10.3115/1073083.1073135
Parida S, Bojar O, Dash SR (2019) Hindi visual genome: a dataset for multimodal english-to-hindi machine translation. Computación y Sistemas. In print. Presented at CICLing 2019, La Rochelle, France
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R (eds) Advances in neural information processing systems, vol 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf
Ren Z, Wang X, Zhang N, Lv X, Li LJ (2017) Deep reinforcement learning-based image captioning with embedding reward. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 290–298
Sanayai Meetei L, Singh TD, Bandyopadhyay S (2019) WAT2019: English-Hindi translation on Hindi visual genome dataset. In: Proceedings of the 6th workshop on asian translation. https://doi.org/10.18653/v1/D19-5224. https://www.aclweb.org/anthology/D19-5224. Association for Computational Linguistics, Hong Kong, pp 181–188
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Singh A, Meetei LS, Singh TD, Bandyopadhyay S (2021) Generation and evaluation of hindi image captions of visual genome. In: Maji AK, Saha G, Das S, Basu S, Tavares JMRS (eds) Proceedings of the international conference on computing and communication systems. Springer Singapore, Singapore, pp 65–73. https://doi.org/10.1007/978-981-33-4084-8_7
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wang C, Yang H, Meinel C (2018) Image captioning with deep bidirectional lstms and multi-task learning. ACM Trans Multimedia Comput Commun Appl 14(2s):40:1–40:20. https://doi.org/10.1145/3115432
Google Scholar
Wang M, Li S, Yang X, Luo C (2016) A parallel-fusion rnn-lstm architecture for image caption generation. In: 2016 IEEE international conference on image processing (ICIP), pp 4448–4452
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Yang Y, Teo CL, Daumé H III, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the conference on empirical methods in natural language processing, pp 444–454. Association for Computational Linguistics
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2:67–78
Article Google Scholar

Download references

Acknowledgements

This work is supported by Scheme for Promotion of Academic and Research Collaboration (SPARC) Project Code: P995 of No: SPARC/2018-2019/119/SL (IN) funded by Ministry of Education (erstwhile MHRD), Govt of India.

Author information

Authors and Affiliations

Center for Natural Language Processing (CNLP), Department of Computer Science and Engineering, National Institute of Technology, Silchar, Assam, India
Alok Singh, Thoudam Doren Singh & Sivaji Bandyopadhyay

Authors

Alok Singh
View author publications
You can also search for this author in PubMed Google Scholar
Thoudam Doren Singh
View author publications
You can also search for this author in PubMed Google Scholar
Sivaji Bandyopadhyay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alok Singh.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: List of symbols and acronyms

Table 13 Abbreviation table for symbols and acronyms used in the paper

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Singh, A., Singh, T.D. & Bandyopadhyay, S. An encoder-decoder based framework for hindi image caption generation. Multimed Tools Appl 80, 35721–35740 (2021). https://doi.org/10.1007/s11042-021-11106-5

Download citation

Received: 16 March 2020
Revised: 11 May 2021
Accepted: 25 May 2021
Published: 09 July 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s11042-021-11106-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An encoder-decoder based framework for hindi image caption generation

Abstract

Access this article

Similar content being viewed by others

Generation and Evaluation of Hindi Image Captions of Visual Genome

Generation of Image Caption Using CNN-LSTM Based Approach

Performance Analysis of Image Caption Generation Techniques Using CNN-Based Encoder–Decoder Architecture

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix: List of symbols and acronyms

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An encoder-decoder based framework for hindi image caption generation

Abstract

Access this article

Similar content being viewed by others

Generation and Evaluation of Hindi Image Captions of Visual Genome

Generation of Image Caption Using CNN-LSTM Based Approach

Performance Analysis of Image Caption Generation Techniques Using CNN-Based Encoder–Decoder Architecture

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix: List of symbols and acronyms

Appendix: List of symbols and acronyms

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation