Abstract
How to measure the distance between heterogeneous data is still an open problem. Many research works have been developed to learn a common subspace where the similarity between different modalities can be calculated directly. However, most of existing works focus on learning a latent subspace but the semantically structural information is not well preserved. Thus, these approaches cannot get desired results. In this paper, we propose a novel framework, termed Cross-modal subspace learning via Kernel correlation maximization and Discriminative structure-preserving (CKD), to solve this problem in two aspects. Firstly, we construct a shared semantic graph to make each modality data preserve the neighbor relationship semantically. Secondly, we introduce the Hilbert-Schmidt Independence Criteria (HSIC) to ensure the consistency between feature-similarity and semantic-similarity of samples. Our model not only considers the inter-modality correlation by maximizing the kernel correlation but also preserves the semantically structural information within each modality. The extensive experiments are performed to evaluate the proposed framework on the three public datasets. The experimental results demonstrate that the proposed CKD is competitive compared with the classic subspace learning methods.
Similar content being viewed by others
References
Akaho S (2007) A kernel method for canonical correlation analysis. In: Proceedings of the International Meeting of the Psychometric Society
Andrew G, Arora R, Bilmes J, et al (2013) Deep canonical correlation analysis[C]//International conference on machine learning. 1247–1255
Chua TS, Tang J, Hong R, et al (2009) NUS-WIDE: a real-world web image database from National University of Singapore[C]//Proceedings of the ACM international conference on image and video retrieval. ACM, 48
Ciocca G, Marini D, Rizzi A, et al (2003) Retinex preprocessing of uncalibrated images for color-based image retrieval[J]. J Elect Imaging 12(1):161–172
Davis JV, Kulis B, Jain P, et al (2007) Information-theoretic metric learning[C]//Proceedings of the 24th international conference on Machine learning. ACM, 209–216
Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder[C]//Proceedings of the 22nd ACM international conference on Multimedia. ACM, 7–16
Gong Y, Ke Q, Isard M, et al (2012) A Multi-View Embedding Space for Modeling Internet Images, Tags, and their Semantics[J]. Int J Comput Vis 106 (2):210–233
Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: An overview with application to learning methods[J]. Neural Comput 16(12):2639–2664
Hu M, Yang Y, Shen F, et al (2019) Collective Reconstructive Embeddings for Cross-Modal Hashing[J]. IEEE Trans Image Process 28(6):2770–2784
Huiskes MJ, Lew MS (2008) The MIR flickr retrieval evaluation[C]//Proceedings of the 1st ACM international conference on Multimedia information retrieval. ACM, 39–43
Jacobs DW, Daume H, Kumar A, et al (2012) Generalized Multiview analysis: A discriminative latent space[C]// 2012 IEEE Conference on Computer Vision and Pattern Recognition IEEE Computer Society
Jia Y, Salzmann M, Darrell T (2011) Learning cross-modality similarity for multinomial data[C]//2011 International Conference on Computer Vision. IEEE, 2407–2414
Jiang S, Song X, Huang Q (2014) Relative image similarity learning with contextual information for Internet cross-media retrieval[J]. Multi Syst 20(6):645–657
Kim TK, Kittler J, Cipolla R (2007) Discriminative learning and recognition of image set classes using canonical correlations[J]. IEEE Trans Patt Anal Mach Intell 29(6):1005–1018
Li Z, Tang J, Mei T (2018) Deep collaborative embedding for social image understanding[J] IEEE transactions on pattern analysis and machine intelligence
Liangli Z, Peng H, Xu W, et al (2019) Deep Supervised Cross-modal Retrieval[C]//Proceedings of the IEEE conference on computer vision and pattern recognition
Lin D, Tang X (2006) Inter-modality face recognition[C]//European conference on computer vision. Springer, Berlin, pp 13–26
Lisanti G, Masi I, DelBimbo A (2014) Matching people across camera views using kernel canonical correlation analysis[C]//Proceedings of the International Conference on Distributed Smart Cameras. ACM, 10
Memon MH, Li JP, Memon I, et al (2017) GEO Matching regions: multiple regions of interests using content based image retrieval based on relative locations[J]. Multi Tools Appl 76(14):1–35
Ngiam J, Khosla A, Kim M, et al (2011) Multimodal deep learning[C]//Proceedings of the 28th international conference on machine learning (ICML-11). 689–696
Nie F, Huang H, Cai X, et al (2010) Efficient and robust feature selection via joint ℓ2,1-norms minimization[C]//Advances in neural information processing systems. 1813–1821
Peng Y, Huang X, Qi J (2016) Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks[C]//IJCAI. 3846–3853
Pereira JC, Coviello E, Doyle G, et al (2013) On the role of correlation and abstraction in cross-modal multimedia retrieval[J]. IEEE trans Patt Anal Mach Intell 36(3):521–535
Principe JC (2010) Information theory, machine learning, and reproducing kernel Hilbert spaces[M]//Information theoretic learning. Springer, New York, pp 1–45
Ranjan V, Rasiwasia N, Jawahar CV (2015) Multi-label cross-modal retrieval[C]//Proceedings of the IEEE International Conference on Computer Vision. 4094–4102
Rasiwasia N, Costa Pereira J, Coviello E et al (2010) A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM international conference on Multimedia. ACM, 251–260.
Sharma A, Jacobs DW (2011) Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch[C]//CVPR 2011. IEEE, 593–600
Shu X, Wu X (2011) A novel contour descriptor for 2D shape matching and its application to image retrieval[J]. Image Vision Comput 29(4):286–294
Song G, Wang S, Huang Q, et al (2017) Multimodal similarity gaussian process latent variable model[J]. IEEE Trans Image Process 26(9):4168–4181
Song T, Cai J, Zhang T, et al (2017) Semi-supervised manifold-embedded hashing with joint feature representation and classifier learning[J]. Pattern Recogn 68:99–110
Srivastava N, Salakhutdinov R (2012) Learning representations for multimodal data with deep belief nets[C]//International conference on machine learning workshop. 79
Tenenbaum JB, Freeman WT (2000) Separating style and content with bilinear models[J]. Neural Comput 12(6):1247–1283
Wang B, Yang Y, Xu X, et al (2017) Adversarial cross-modal retrieval[C]//Proceedings of the 25th ACM international conference on Multimedia. ACM, 154–162
Wang D, Gao X, Wang X, et al (2018) Label consistent matrix factorization hashing for large-scale cross-modal similarity search[J] IEEE Transactions on Pattern Analysis and Machine Intelligence
Wang D, Wang Q, Gao X (2017) Robust and flexible discrete hashing for Cross-Modal similarity Search[J]. IEEE Trans Circuits Syst Video Technol 1–1
Wang H, Sahoo D, Liu C, et al (2019) Learning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11572–11581
Wang K, He R, Wang L, et al (2015) Joint feature selection and subspace learning for cross-modal retrieval[J]. IEEE Trans Patt Anal Mach Intell 38(10):2010–2023
Wei Y, Zhao Y, Lu C, et al (2017) Cross-modal retrieval with CNN visual features: A new baseline[J]. IEEE Trans Cyber 47(2):449–460
Xu M, Zhu Z, Zhao Y, et al (2018) Subspace learning by kernel dependence maximization for cross-modal retrieval[J]. Neurocomputing 309:94–105
Xu X, et al (2017) Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval. IEEE Trans Image Process 26(5):2494–2507
Yu J, Wu X, Kittler J (2018) Semi-supervised Hashing for Semi-Paired Cross-View Retrieval, 2018 24th International Conference on Pattern Recognition (ICPR), Beijing 958–963
Yu J, Wu XJ, Kittler J (2019) Discriminative Supervised Hashing for Cross-Modal Similarity Search[J]. Image Vision Comput 89:50–56
Zhang C, Wang X, Feng J, et al (2017) A car-face region-based image retrieval method with attention of SIFT features[J]. Multi Tools Appl 76(8):1–20
Zheng L, Wang S, Tian Q (2014) Lp-norm IDF for Scalable Image Retrieval[J]. Image Process IEEE Trans On 23(8):3604–3617
Acknowledgments
The paper is supported by the national natural science foundation of china(grant no.61672265,u1836218), and the 111 project of ministry of education of china (grant no. b12018).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yu, J., Wu, XJ. Cross-modal subspace learning via kernel correlation maximization and discriminative structure-preserving. Multimed Tools Appl 79, 34647–34663 (2020). https://doi.org/10.1007/s11042-020-08989-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-08989-1