Collaborative Detection and Caption Network

Wang, Tianyi; Zhang, Jiang; Zha, Zheng-Jun

doi:10.1007/978-3-030-00776-8_10

Tianyi Wang¹⁸,
Jiang Zhang¹⁹ &
Zheng-Jun Zha²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11164))

Included in the following conference series:

Pacific Rim Conference on Multimedia

3638 Accesses

Abstract

Recently it has been shown that deep recurrent neural network can be utilized to train video captioning systems. However, existing approaches often are perplexed by vagueness among videos, which often lead to grammatical correct but less germane results. In this paper, we propose an effective end-to-end network, called a Collaborative Detection and Caption network, which takes a video caption network as video-to-sentence sub-network and principle syntactic components detector as video-to-words sub-network. Our detector and caption network warp spatial-correlated attributes with temporal attention model and are optimized jointly which could facilitate each other. Experiments on the YouTube2Text, MPII movie description datasets and MVAD datasets consistently show that our proposed network can generate crucial contents needed for describing videos and thus enhance caption and detection performance simultaneously. Also, metric scores reported on those benchmarks have outperforms the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We use the Stanford Parser [12] to parse captions and choose the noun, verb and obj of the tokenization results as their subject-verb-object triplet for each caption.
2.
https://github.com/tylin/coco-caption.

References

Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chen, X., Zitnick, C.L.: Mind’s eye: a recurrent visual representation for image caption generation, pp. 2422–2431 (2014)
Google Scholar
Cho, K., Courville, A., Bengio, Y.: Describing multimedia content using attention-based encoder-decoder networks. IEEE Trans. Multimed. 17(11), 1875–1886 (2015)
Article Google Scholar
Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evaluation for any target language. In: The Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Google Scholar
Du, T., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: C3d: generic features for video analysis. Eprint Arxiv (2014)
Google Scholar
Fang, H., et al: From captions to visual concepts and back. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1473–1482 (2015). https://doi.org/10.1109/CVPR.2015.7298754
Flick, C.: ROUGE: a package for automatic evaluation of summaries. In: The Workshop on Text Summarization Branches Out, p. 10 (2004)
Google Scholar
Guadarrama, S., et al.: YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: IEEE International Conference on Computer Vision, pp. 2712–2719 (2013)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, pp. 770–778 (2015)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Kaufman, D., Levi, G., Hassner, T., Wolf, L.: Temporal tessellation: a unified approach for video analysis (2017). 2(9), II-II
Google Scholar
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pp. 423–430. Association for Computational Linguistics (2003)
Google Scholar
Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning, no. 6, pp. 1029–1038 (2015)
Google Scholar
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4594–4602, June 2016. https://doi.org/10.1109/CVPR.2016.497
Pan, Y., Yao, T., Li, H., Mei, T.: Video captioning with transferred semantic attributes (2016)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Meeting on Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: International Conference on Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Rohrbach, A., et al.: Movie description. Int. J. Comput. Vis. (2017). http://resources.mpi-inf.mpg.de/publications/D1/2016/2310198.pdf
Song, J., Guo, Z., Gao, L., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with adjusted temporal attention for video captioning (2017)
Google Scholar
Torabi, A., Pal, C., Larochelle, H., Courville, A.: Using descriptive video services to create a large data source for video annotation research. Computer Science (2015)
Google Scholar
Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)
Article Google Scholar
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence - video to text (2015)
Google Scholar
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. CoRR abs/1412.4729 (2014). http://arxiv.org/abs/1412.4729
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Computer Science, pp. 2048–2057 (2015)
Google Scholar
Yao, L., et al.: Describing videos by exploiting temporal structure. In: IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)
Google Scholar
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention, pp. 4651–4659 (2016)
Google Scholar
Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: Computer Vision and Pattern Recognition, pp. 4584–4593 (2016)
Google Scholar
Yu, Y., Ko, H., Choi, J., Kim, G.: Video captioning and retrieval models with semantic attention. arXiv preprint arXiv:1610.02947 (2016)

Download references

Acknowledgment

This work was supported by the National Natural Science Foundation of China (NSFC) under Grants 61622211, 61472392, 61751304 and 61620106009.

Author information

Authors and Affiliations

iFlytek Co., Ltd., Hefei, China
Tianyi Wang
Hangzhou Dianzi University, Hangzhou, China
Jiang Zhang
University of Science and Technology of China, Hefei, China
Zheng-Jun Zha

Authors

Tianyi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zheng-Jun Zha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiang Zhang .

Editor information

Editors and Affiliations

Hefei University of Technology, Hefei, China
Richang Hong
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
University of Tokyo, Tokyo, Japan
Toshihiko Yamasaki
Hefei University of Technology, Hefei, China
Meng Wang
City University of Hong Kong, Hong Kong, Hong Kong
Chong-Wah Ngo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, T., Zhang, J., Zha, ZJ. (2018). Collaborative Detection and Caption Network. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11164. Springer, Cham. https://doi.org/10.1007/978-3-030-00776-8_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-00776-8_10
Published: 19 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00775-1
Online ISBN: 978-3-030-00776-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics