Abstract
Automatic image captioning deals with the objective of describing an image in human understandable natural language. Majority of the existing approaches aiming to solve this problem are based on holistic techniques which translate the whole image into a single sentence description rendering the possibility of losing important aspects of the scene. To enable better and more detailed caption generations, we propose a dense captioning architecture which first extracts and describes the objects of the image which in turn is helpful in generating dense and detailed image captions. The proposed architecture has two modules where the first one generates the region descriptions that describe the objects and their relationships while the other generates object attributes which are helpful to produce object details. Both of these outputs are concatenated and given as input to another sentence generation that is based on an encoder-decoder formulation to generate a single meaningful and grammatically detailed sentence. The results achieved with the proposed architecture shows superior performance when compared with current state-of-the-art image captioning techniques e.g., Neural Talk and Show, Attend and Tell, using standard evaluation metrics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Grubinger, M., Clough, P., Müller, H., Deselaers, T.: The IAPR TC-12 benchmark: A new evaluation resource for visual information systems. In: International workshop ontoImage. vol. 5, p. 10 (2006)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Jaderberg, M., Simonyan, K., Zisserman, A.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp. 2017–2025 (2015)
Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4565–4574 (2016)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3128–3137 (2015)
Kelvin, X., et al.: Show, attend and tell: neural image caption generation with visual attention, 1 June 2015
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kolb, P.: DISCO: A multilingual database of distributionally similar words. In: In Proceedings of KONVENS (2008)
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: 2013 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 554–561. IEEE (2013)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)
Kulkarni, G., et al.: Baby talk: understanding and generating image descriptions. In: Proceedings of the 24th CVPR. Citeseer (2011)
Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 6 (2017)
Manning, C.D.: Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In: Gelbukh, A.F. (ed.) CICLing 2011. LNCS, vol. 6608, pp. 171–189. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19400-9_14
Nganji, J.T., Brayshaw, M., Tompsett, B.: Describing and assessing image descriptions for visually impaired web users with IDAT. In: Proceedings of the Third International Conference on Intelligent Human Computer Interaction (IHCI 2011), Prague, Czech Republic, August 2011, pp. 27–37. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-31603-6_3
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks, pp. 91–99 (2015)
Shen, J., et al.: Unified structured learning for simultaneous human pose estimation and garment attribute classification. IEEE Trans. Image Process. 23(11), 4786–4798 (2014)
Trundle, S.S., McCarthy, R.J., Martin, J.P., Slavin, A.J., Hutz, D.J.: Image surveillance and reporting technology (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164. IEEE (2015)
Ye, C., Yang, Y., Mao, R., Fermüller, C., Aloimonos, Y.: What can i do around here? Deep functional scene understanding for cognitive robots. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 4604–4611. IEEE (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Khurram, I., Fraz, M.M., Shahzad, M. (2018). Detailed Sentence Generation Architecture for Image Semantics Description. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2018. Lecture Notes in Computer Science(), vol 11241. Springer, Cham. https://doi.org/10.1007/978-3-030-03801-4_37
Download citation
DOI: https://doi.org/10.1007/978-3-030-03801-4_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03800-7
Online ISBN: 978-3-030-03801-4
eBook Packages: Computer ScienceComputer Science (R0)