Abstract
In this paper, we consider multi-modal retrieval from the perspective of deep textual-visual learning so as to preserve the correlations between multi-modal data. More specifically, We propose a general multi-modal retrieval algorithm to maximize the canonical correlations between multi-modal data via deep learning, which we call Deep Textual-Visual correlation learning (DTV). In DTV, given pairs of images and their describing documents, a convolutional neural network is implemented to learn the visual representation of images and a dependency-tree recursive neural network(DT-RNN) is conducted to learn compositional textual representations of documents respectively, then DTV projects the visual-textual representation into a common embedding space where each pair of multi-modal data is maximally correlated subject to being unrelated with other pairs by matrix-vector canonical correlation analysis (CCA). The experimental results indicate the effectiveness of our proposed DTV when applied to multi-modal retrieval.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Mikolov, T., et al. Efficient estimation of word representations in vector space (2013). arXiv preprint arXiv:1301.3781
Socher, R., et al. Grounded Compositional Semantics for Finding and Describing Images with Sentences. NIPS Deep Learning Workshop (2013)
Donahue, J., et al. Decaf: A deep convolutional activation feature for generic visual recognition (2013). arXiv preprint arXiv:1310.1531
Krizhevsky, A., Ilya, S., Geoffrey, E.H.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012)
Blei, D.M., Andrew, Y.N., Michael, I.J.: Latent dirichlet allocation. J. mach. Learn. Res. 3, 993–1022 (2003)
Lee, S.H., Seungjin, C.: Two-dimensional canonical correlation analysis. IEEE Signal Process. Lett. 14(10), 735–738 (2007)
Siagian, C., Itti, L.: Rapid biologically-inspired scene classification using features shared with visual attention. IEEE Trans. Pattern Anal. Mach. Intell. 29(2), 300–312 (2007)
Hotelling, H.: Relations between two sets of variates. Biometrika 28, 321–377 (1936)
Rasiwasia, N., et al. A new approach to cross-modal multimedia retrieval. In: Proceedings of the International Conference on Multimedia. ACM (2010)
Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 127–134. ACM (2003)
Salomatin, K., Yiming, Y., Abhimanyu, L.: Multi-field Correlated Topic Modeling. SDM (2009)
Putthividhy, D., Hagai, T.A., Srikantan, S.N.: Topic regression multi-modal latent dirichlet allocation for image annotation. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2010)
Zhuang, Y., et al. Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval. AAAI (2013)
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. Comput. Vis. Image Underst. 106(1), 59–70 (2007)
Zhen, Y., Dit-Yan, Y.: A probabilistic model for multimodal hash function learning. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2012)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
De Marneffe, M.-C., MacCartney, B., Manning, C.D.: Generating typed dependency parses from phrase structure parses. In: Proceedings of LREC, vol. 6 (2006)
Deng, Jia., et al. Imagenet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009. IEEE (2009)
Jia, Y., Mathieu, S., Trevor, D.: Learning cross-modality similarity for multinomial data. In: 2011 IEEE International Conference on Computer Vision (ICCV). IEEE (2011)
Acknowledgments
This work is supported in part by 973 Program (2012CB316400), NSFC (61402401), Zhejiang Provincial Natural Science Foundation of China (LQ14F010004), Chinese Knowledge Center of Engineering Science and Technology (CKCEST).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Song, J., Wang, Y., Wu, F., Lu, W., Tang, S., Zhuang, Y. (2015). Multi-modal Retrieval via Deep Textual-Visual Correlation Learning. In: He, X., et al. Intelligence Science and Big Data Engineering. Image and Video Data Engineering. IScIDE 2015. Lecture Notes in Computer Science(), vol 9242. Springer, Cham. https://doi.org/10.1007/978-3-319-23989-7_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-23989-7_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23987-3
Online ISBN: 978-3-319-23989-7
eBook Packages: Computer ScienceComputer Science (R0)