Abstract
Image-text retrieval is a challenging task in the field of vision and language. The existing methods mainly compute the similarity of image-text pairs by the alignment between image regions and text words. Although these methods based on fine-grained local features achieve good results, these methods only explore the correspondence between salient objects and ignore the deep semantic information expressed by the whole image and text. Thus, we propose a novel multi-level local alignment and semantic matching network (MLASM) that introduces a multi-level semantic matching module after local alignment. This module supplies our model with more sufficient semantic information to understand the complex correlations between images and texts. Experiment results on two benchmark datasets Flickr30K and MS-COCO show that our MLASM achieves state-of-the-art performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agrawal, A., et al.: VQA: visual question answering. Int. J. Comput. Vis. 123, 4–31 (2015)
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., Han, J.: IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12652–12660 (2020)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014)
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Huang, F., Zhang, X., Zhao, Z., Li, Z.: Bi-directional spatial-semantic attention networks for image-text matching. IEEE Trans. Image Process. 28(4), 2008–2020 (2019)
Ji, Z., Chen, K., Wang, H.: Step-wise hierarchical alignment network for image-text matching. ArXiv arXiv:2106.06509 (2021)
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision, vol. 139, pp. 5583–5594 (2021)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7
Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. ArXiv arXiv:1803.08024 (2018)
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.R.: Visual semantic reasoning for image-text matching. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4653–4661 (2019)
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2641–2649 (2015)
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2015)
Wang, S., Chen, Y., Zhuo, J., Huang, Q., Tian, Q.: Joint global and co-attentive representation learning for image-sentence retrieval. In: Proceedings of the 26th ACM International Conference on Multimedia (2018)
Wen, K., Gu, X., Cheng, Q.: Learning dual semantic relations with graph attention for image-text matching. IEEE Trans. Circ. Syst. Video Technol. 31, 2866–2879 (2021)
Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
Xu, X., Wang, T., Yang, Y., Zuo, L., Shen, F., Shen, H.T.: Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Netw. Learn. Syst. 31(12), 5412–5425 (2020)
Zhang, Q., Lei, Z., Zhang, Z., Li, S.: Context-aware attention network for image-text retrieval. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3533–3542 (2020)
Zhen, L., Hu, P., Wang, X., Peng, D.: Deep supervised cross-modal retrieval. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10386–10395 (2019)
Acknowledgement
This work is supported by the National Key R &D Program of China (2021YFF0602104-2).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Jiang, Z., Lian, Z. (2022). Mutil-level Local Alignment and Semantic Matching Network for Image-Text Retrieval. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds) Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol 13531. Springer, Cham. https://doi.org/10.1007/978-3-031-15934-3_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-15934-3_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15933-6
Online ISBN: 978-3-031-15934-3
eBook Packages: Computer ScienceComputer Science (R0)