Skip to main content

Learning to Describe E-Commerce Images from Noisy Online Data

  • Conference paper
  • First Online:
Computer Vision – ACCV 2016 (ACCV 2016)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10115))

Included in the following conference series:

Abstract

Recent study shows successful results in generating a proper language description for the given image, where the focus is on detecting and describing the contextual relationship in the image, such as the kind of object, relationship between two objects, or the action. In this paper, we turn our attention to more subjective components of descriptions that contain rich expressions to modify objects – namely attribute expressions. We start by collecting a large amount of product images from the online market site Etsy, and consider learning a language generation model using a popular combination of a convolutional neural network (CNN) and a recurrent neural network (RNN). Our Etsy dataset contains unique noise characteristics often arising in the online market. We first apply natural language processing techniques to extract high-quality, learnable examples in the real-world noisy data. We learn a generation model from product images with associated title descriptions, and examine how e-commerce specific meta-data and fine-tuning improve the generated expression. The experimental results suggest that we are able to learn from the noisy online data and produce a product description that is closer to a man-made description with possibly subjective attribute expressions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://vision.is.tohoku.ac.jp/~kyamagu/research/etsy-dataset.

References

  1. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh, D.: VQA: visual question answering. In: International Conference on Computer Vision (ICCV) (2015)

    Google Scholar 

  2. Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, vol. 29, pp. 65–72 (2005)

    Google Scholar 

  3. Berg, T.L., Berg, A.C., Shih, J.: Automatic attribute discovery and characterization from noisy web data. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 663–676. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15549-9_48

    Chapter  Google Scholar 

  4. Chen, D., Manning, C.D.: A fast and accurate dependency parser using neural networks. In: EMNLP, pp. 740–750 (2014)

    Google Scholar 

  5. Chen, X., Shrivastava, A., Gupta, A.: Neil: extracting visual knowledge from web data. In: ICCV, pp. 1409–1416, December 2013

    Google Scholar 

  6. Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., Mitchell, M.: Language models for image captioning: the quirks and what works. In: Association for Computational Linguistics (ACL), pp. 100–105 (2015)

    Google Scholar 

  7. Di, W., Bhardwaj, A., Jagadeesh, V., Piramuthu, R., Churchill, E.: When relevance is not enough: promoting visual attractiveness for fashion e-commerce. arXiv preprint arXiv:1406.3561 (2014)

  8. Di, W., Sundaresan, N., Piramuthu, R., Bhardwaj, R.: Is a picture really worth a thousand words?:-on the role of images in e-commerce. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp. 633–642. ACM (2014)

    Google Scholar 

  9. Divvala, S., Farhadi, A., Guestrin, C.: Learning everything about anything: webly-supervised visual concept learning. In: CVPR (2014)

    Google Scholar 

  10. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD (1996)

    Google Scholar 

  11. Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)

    MathSciNet  MATH  Google Scholar 

  12. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)

    Google Scholar 

  13. Hadi Kiapour, M., Han, X., Lazebnik, S., Berg, A.C., Berg, T.L.: Where to buy it: matching street clothing photos in online shops. In: ICCV (2015)

    Google Scholar 

  14. Kovashka, A., Parikh, D., Grauman, K.: Whittlesearch: image search with relative attribute feedback. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2973–2980. IEEE (2012)

    Google Scholar 

  15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  16. Lampert, C.H., Nickisch, H., Harmeling, S.: Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 36(3), 453–465 (2014)

    Article  Google Scholar 

  17. Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, vol. 8 (2004)

    Google Scholar 

  18. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10602-1_48

    Google Scholar 

  19. Liu, S., Song, Z., Liu, G., Xu, C., Lu, H., Yan, S.: Street-to-shop: cross-scenario clothing retrieval via parts alignment and auxiliary set. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3330–3337. IEEE (2012)

    Google Scholar 

  20. Mathews, A.P., Xie, L., He, X.: Senticap: generating image descriptions with sentiments. CoRR, abs/1510.01431 (2015)

    Google Scholar 

  21. Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: Advances in Neural Information Processing Systems, pp. 1143–1151 (2011)

    Google Scholar 

  22. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)

    Google Scholar 

  23. Parikh, D., Grauman, K.: Relative attributes. In: Metaxas, D.N., Quan, L., Sanfeliu, A., Van Gool, L.J. (eds.) ICCV, pp. 503–510. IEEE Computer Society, Washington, D.C (2011)

    Google Scholar 

  24. Sun, C., Gan, C., Nevatia, R.: Automatic concept discovery from parallel text and visual corpora. In: ICCV, pp. 2596–2604 (2015)

    Google Scholar 

  25. Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.-J.: The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817 (2015)

  26. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)

    Google Scholar 

  27. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)

    Google Scholar 

  28. Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 (2015)

  29. You, Q., Luo, J., Jin, H., Yang, J.: Robust image sentiment analysis using progressively trained and domain transferred deep networks. arXiv preprint arXiv:1509.06041 (2015)

  30. Zakrewsky, S., Aryafar, K., Shokoufandeh, A.: Item popularity prediction in e-commerce using image quality feature vectors. arXiv e-prints, May 2016

    Google Scholar 

  31. Zakrewsky, S., Aryafar, K., Shokoufandeh, A.: Item popularity prediction in e-commerce using image quality feature vectors. arXiv preprint arXiv:1605.03663 (2016)

  32. Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. CoRR, abs/1409.2329 (2014)

    Google Scholar 

Download references

Acknowledgement

This work was supported by JSPS KAKENHI Grant Numbers JP15H05919 and JP15H05318.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Takuya Yashima .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Yashima, T., Okazaki, N., Inui, K., Yamaguchi, K., Okatani, T. (2017). Learning to Describe E-Commerce Images from Noisy Online Data. In: Lai, SH., Lepetit, V., Nishino, K., Sato, Y. (eds) Computer Vision – ACCV 2016. ACCV 2016. Lecture Notes in Computer Science(), vol 10115. Springer, Cham. https://doi.org/10.1007/978-3-319-54193-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-54193-8_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-54192-1

  • Online ISBN: 978-3-319-54193-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics