Mutil-level Local Alignment and Semantic Matching Network for Image-Text Retrieval

Jiang, Zhukai; Lian, Zhichao

doi:10.1007/978-3-031-15934-3_18

Zhukai Jiang¹² &
Zhichao Lian¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13531))

Included in the following conference series:

International Conference on Artificial Neural Networks

1894 Accesses

Abstract

Image-text retrieval is a challenging task in the field of vision and language. The existing methods mainly compute the similarity of image-text pairs by the alignment between image regions and text words. Although these methods based on fine-grained local features achieve good results, these methods only explore the correspondence between salient objects and ignore the deep semantic information expressed by the whole image and text. Thus, we propose a novel multi-level local alignment and semantic matching network (MLASM) that introduces a multi-level semantic matching module after local alignment. This module supplies our model with more sufficient semantic information to understand the complex correlations between images and texts. Experiment results on two benchmark datasets Flickr30K and MS-COCO show that our MLASM achieves state-of-the-art performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agrawal, A., et al.: VQA: visual question answering. Int. J. Comput. Vis. 123, 4–31 (2015)
Article MathSciNet Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., Han, J.: IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12652–12660 (2020)
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014)
Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Huang, F., Zhang, X., Zhao, Z., Li, Z.: Bi-directional spatial-semantic attention networks for image-text matching. IEEE Trans. Image Process. 28(4), 2008–2020 (2019)
Article MathSciNet Google Scholar
Ji, Z., Chen, K., Wang, H.: Step-wise hierarchical alignment network for image-text matching. ArXiv arXiv:2106.06509 (2021)
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision, vol. 139, pp. 5583–5594 (2021)
Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7
Article MathSciNet Google Scholar
Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. ArXiv arXiv:1803.08024 (2018)
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.R.: Visual semantic reasoning for image-text matching. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4653–4661 (2019)
Google Scholar
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2641–2649 (2015)
Google Scholar
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2015)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2015)
Google Scholar
Wang, S., Chen, Y., Zhuo, J., Huang, Q., Tian, Q.: Joint global and co-attentive representation learning for image-sentence retrieval. In: Proceedings of the 26th ACM International Conference on Multimedia (2018)
Google Scholar
Wen, K., Gu, X., Cheng, Q.: Learning dual semantic relations with graph attention for image-text matching. IEEE Trans. Circ. Syst. Video Technol. 31, 2866–2879 (2021)
Article Google Scholar
Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
Google Scholar
Xu, X., Wang, T., Yang, Y., Zuo, L., Shen, F., Shen, H.T.: Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Netw. Learn. Syst. 31(12), 5412–5425 (2020)
Article Google Scholar
Zhang, Q., Lei, Z., Zhang, Z., Li, S.: Context-aware attention network for image-text retrieval. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3533–3542 (2020)
Google Scholar
Zhen, L., Hu, P., Wang, X., Peng, D.: Deep supervised cross-modal retrieval. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10386–10395 (2019)
Google Scholar

Download references

Acknowledgement

This work is supported by the National Key R &D Program of China (2021YFF0602104-2).

Author information

Authors and Affiliations

Nanjing University of Science and Technology, Nanjing, China
Zhukai Jiang & Zhichao Lian

Authors

Zhukai Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Zhichao Lian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhichao Lian .

Editor information

Editors and Affiliations

University of the West of England, Bristol, UK
Elias Pimenidis
Lancaster University, Lancaster, UK
Plamen Angelov
Digital Innovation, Teeside University, Middlesbrough, UK
Chrisina Jayne
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
The University of the West of England, Bristol, UK
Mehmet Aydin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, Z., Lian, Z. (2022). Mutil-level Local Alignment and Semantic Matching Network for Image-Text Retrieval. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds) Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol 13531. Springer, Cham. https://doi.org/10.1007/978-3-031-15934-3_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-15934-3_18
Published: 15 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15933-6
Online ISBN: 978-3-031-15934-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Mutil-level Local Alignment and Semantic Matching Network for Image-Text Retrieval