Relational Visual-Textual Information Retrieval

Messina, Nicola

doi:10.1007/978-3-030-60936-8_33

Nicola Messina ORCID: orcid.org/0000-0003-3011-2487¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12440))

Included in the following conference series:

International Conference on Similarity Search and Applications

851 Accesses

Abstract

With the advent of deep learning, multimedia information processing gained a huge boost, and astonishing results have been observed on a multitude of interesting visual-textual tasks. Relation networks paved the way towards an attentive processing methodology that considers images and texts as sets of basic interconnected elements (regions and words). These winning ideas recently helped to reach the state-of-the-art on the image-text matching task. Cross-media information retrieval has been proposed as a benchmark to test the capabilities of the proposed networks to match complex multi-modal concepts in the same common space. Modern deep-learning powered networks are complex and almost all of them cannot provide concise multi-modal descriptions that can be used in fast multi-modal search engines. In fact, the latest image-sentence matching networks use cross-attention and early-fusion approaches, which force all the elements of the database to be considered at query time. In this work, I will try to lay down some ideas to bridge the gap between the effectiveness of modern deep-learning multi-modal matching architectures and their efficiency, as far as fast and scalable visual-textual information retrieval is concerned.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://press.liacs.nl/mirflickr/.

References

Amato, G., Carrara, F., Falchi, F., Gennaro, C., Vadicamo, L.: Large-scale instance-level image retrieval. Inf. Process. Manag. 102100 (2019)
Google Scholar
Amato, G., Falchi, F., Gennaro, C., Vadicamo, L.: Deep permutations: deep convolutional neural networks and permutation-based indexing. In: Amsaleg, L., Houle, M.E., Schubert, E. (eds.) SISAP 2016. LNCS, vol. 9939, pp. 93–106. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46759-7_7
Chapter Google Scholar
Chen, Y.C., et al.: UNITER: learning universal image-text representations. arXiv preprint arXiv:1909.11740 (2019)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT 2019, pp. 4171–4186. Association for Computational Linguistics (2019)
Google Scholar
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS 2019, pp. 13–23 (2019)
Google Scholar
MacAvaney, S., Nardini, F.M., Perego, R., Tonellotto, N., Goharian, N., Frieder, O.: Efficient document re-ranking for transformers by precomputing term representations. arXiv preprint arXiv:2004.14255 (2020)
MacAvaney, S., Nardini, F.M., Perego, R., Tonellotto, N., Goharian, N., Frieder, O.: Expansion via prediction of importance with contextualization. arXiv preprint arXiv:2004.14245 (2020)
Messina, N., Amato, G., Carrara, F., Falchi, F., Gennaro, C.: Learning relationship-aware visual features. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11132, pp. 486–501. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11018-5_40
Chapter Google Scholar
Messina, N., Amato, G., Carrara, F., Falchi, F., Gennaro, C.: Learning visual features for relational CBIR. Int. J. Multimed. Inf. Retrieval (2019). https://doi.org/10.1007/s13735-019-00178-7
Messina, N., Amato, G., Falchi, F.: Re-implementing and extending relation network for R-CBIR. In: Ceci, M., Ferilli, S., Poggi, A. (eds.) IRCDL 2020. CCIS, vol. 1177, pp. 82–92. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-39905-4_9
Chapter Google Scholar
Messina, N., Amato, G., Falchi, F., Gennaro, C., Marchand-Maillet, S.: Cross-media visual and textual retrieval using transformer-encoder deep features. In: SISAP 2020 (2020, submitted)
Google Scholar
Messina, N., Falchi, F., Esuli, A., Amato, G.: Transformer reasoning network for image-text matching and retrieval. arXiv preprint arXiv:2004.09144 (2020)
Qi, D., Su, L., Song, J., Cui, E., Bharti, T., Sacheti, A.: ImageBERT: cross-modal pre-training with large-scale weak-supervised image-text data. CoRR abs/2001.07966 (2020)
Google Scholar
Santoro, A., et al.: A simple neural network module for relational reasoning. In: Advances in Neural Information Processing Systems, pp. 4967–4976 (2017)
Google Scholar
Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. In: Bengio, Y., LeCun, Y. (eds.) 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016, Conference Track Proceedings (2016)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS 2017, pp. 5998–6008 (2017)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Information Science and Technologies, National Research Council, Pisa, Italy
Nicola Messina

Authors

Nicola Messina
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicola Messina .

Editor information

Editors and Affiliations

National Institute of Informatics, Tokyo, Japan
Shin'ichi Satoh
ISTI-CNR, Pisa, Italy
Lucia Vadicamo
University of Southern Denmark, Odense M, Denmark
Arthur Zimek
ISTI-CNR, Pisa, Italy
Fabio Carrara
University of Bologna, Bologna, Italy
Ilaria Bartolini
IT University of Copenhagen, Copenhagen, Denmark
Martin Aumüller
IT University of Copenhagen, Copenhagen, Denmark
Björn Þór Jónsson
IT University of Copenhagen, Copenhagen, Denmark
Rasmus Pagh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Messina, N. (2020). Relational Visual-Textual Information Retrieval. In: Satoh, S., et al. Similarity Search and Applications. SISAP 2020. Lecture Notes in Computer Science(), vol 12440. Springer, Cham. https://doi.org/10.1007/978-3-030-60936-8_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-60936-8_33
Published: 14 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60935-1
Online ISBN: 978-3-030-60936-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics