Deep Learning and Shared Representation Space Learning Based Cross-Modal Multimedia Retrieval

Zou, Hui; Du, Ji-Xiang; Zhai, Chuan-Min; Wang, Jing

doi:10.1007/978-3-319-42294-7_28

Hui Zou¹⁵,
Ji-Xiang Du¹⁵,
Chuan-Min Zhai¹⁵ &
…
Jing Wang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9772))

Included in the following conference series:

International Conference on Intelligent Computing

2022 Accesses
2 Citations

Abstract

An increasing number of different multimedia information, including text, voice, video and image, are used to describe the same semantic concept together on the Internet. This paper presents a new method to more efficiently cross-modal multimedia retrieval. Using image and text as an example, we learn the deep learning features of images by convolution neural networks, and learn the text features by a latent Dirichlet allocation model. Then map the two features spaces into a shared presentation space by a probability model in order that they are isomorphic. At last, we adopt centered correlation to measure the distance between them. The experimental results in the Wikipedia dataset show that our approach can achieve the state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Yang, Y., Xu, D., Nie, F., Luo, J., Zhuang, Y.: Ranking with local regression and global alignment for cross media retrieval. In: International Conference on Multimedia, pp. 175–184 (2009)
Google Scholar
Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep Boltzmann machines. In: Neural Information Processing Systems, pp. 2222–2230 (2012)
Google Scholar
Lu, X., Wu, F., Tang, S.: A low rank structural large margin method for cross-modal ranking. In: Research and Development in Information Retrieval, pp. 433–442 (2013)
Google Scholar
Lu, X., Wu, F., Tang, S., Zhang, Z., He, X., Zhuang, Y.: Cross-media semantic representation via bi-directional learning to rank. In: International Conference on Multimedia, pp. 877–886 (2013)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: International Conference on Machine Learning, pp. 282–289 (2001)
Google Scholar
Xu, X.S., Jiang, Y., Peng, L., Xue, X., Zhou, Z.H.: Ensemble approach based on conditional random field for multi-label image and video annotation. In: International Conference on Multimedia, pp. 1377–1380 (2011)
Google Scholar
Zhang, Y., Li, G., Chu, L., Wang, S., Zhang, W., Huang, Q.: Cross-media topic detection: a multi-modality fusion framework. In: International Conference on IEEE, pp. 1–6 (2013)
Google Scholar
Li, L., Jiang, S., Huang, Q.: Learning image vicept description via mixed-norm regularization for large scale semantic image search. In: Computer Vision and Pattern Recognition, pp. 825–832 (2011)
Google Scholar
Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G.R., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: International Conference on Multimedia, pp. 251–260 (2010)
Google Scholar
Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Article MathSciNet MATH Google Scholar
LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Computer Vision and Pattern Recognition Workshops, pp. 512–519 (2014)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Conference on Uncertainty in Artificial Intelligence, pp. 487–494 (2004)
Google Scholar
Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Conference on Empirical Methods in Natural Language Processing, pp. 248–256 (2009)
Google Scholar
Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link LDA: joint models of topic and author community. In: Annual International Conference on Machine Learning, pp. 665–672 (2009)
Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: International Conference on Machine Learning, pp. 689–696 (2011)
Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: International Conference on Multimedia, pp. 675–678 (2014)
Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: International Conference on Machine Learning, pp. 807–814 (2010)
Google Scholar
Li, J., Luo, W., Yang, J., Yuan, X.: Why Does The Unsupervised Pretraining Encourages Moderate-Sparseness. arXiv Preprint arXiv:1312.5813 (2013)
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving Neural Networks by Preventing Co-adaptation of Feature Detectors. arXiv Preprint arXiv:1207.0580 (2012)
Wang, W., Ooi, B.C., Yang, X., Zhang, D., Zhuang, Y.: Effective multi-modal retrieval based on stacked auto-encoders. Proc. VLDB Endowment 7(8), 649–660 (2014)
Article Google Scholar
Wu, F., Jiang, X., Li, X., Tang, S., Lu, W., Zhang, Z., Zhuang, Y.: Cross-modal learning to rank via latent joint representation. Image Process. 24(5), 1497–1509 (2015)
Article MathSciNet Google Scholar
Ling, L., Zhai, X., Peng, Y.: Tri-space and ranking based heterogeneous similarity measure for cross-media retrieval. In: Pattern Recognition International Conference on IEEE, pp. 230–233 (2012)
Google Scholar

Download references

Acknowledgement

This work was supported by the Grant of the National Science Foundation of China (No. 61175121, 61502183), the Grant of the National Science Foundation of Fujian Province (No. 2013J06014), the Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (No. ZQN-YX108), the Scientific Research Funds of Huaqiao University (No. 600005-Z15Y0016), and Subsidized Project for Cultivating Postgraduates’ Innovative Ability in Scientific Research of Huaqiao University (Nos. 1400214009, 1400214003).

Author information

Authors and Affiliations

Department of Computer Science and Technology, Huaqiao University, Xiamen, 361021, China
Hui Zou, Ji-Xiang Du, Chuan-Min Zhai & Jing Wang

Authors

Hui Zou
View author publications
You can also search for this author in PubMed Google Scholar
Ji-Xiang Du
View author publications
You can also search for this author in PubMed Google Scholar
Chuan-Min Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Jing Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ji-Xiang Du .

Editor information

Editors and Affiliations

Tongji University , Shanghai, China
De-Shuang Huang
University of Ulsan , Ulsan, Korea (Republic of)
Kang-Hyun Jo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zou, H., Du, JX., Zhai, CM., Wang, J. (2016). Deep Learning and Shared Representation Space Learning Based Cross-Modal Multimedia Retrieval. In: Huang, DS., Jo, KH. (eds) Intelligent Computing Theories and Application. ICIC 2016. Lecture Notes in Computer Science(), vol 9772. Springer, Cham. https://doi.org/10.1007/978-3-319-42294-7_28

Download citation

DOI: https://doi.org/10.1007/978-3-319-42294-7_28
Published: 12 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42293-0
Online ISBN: 978-3-319-42294-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics