Question action relevance and editing for visual question answering

Toor, Andeep S.; Wechsler, Harry; Nappi, Michele

doi:10.1007/s11042-018-6097-z

Question action relevance and editing for visual question answering

Published: 10 May 2018

Volume 78, pages 2921–2935, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Andeep S. Toor¹,
Harry Wechsler¹ &
Michele Nappi²

422 Accesses
13 Citations
3 Altmetric
Explore all metrics

Abstract

Visual Question Answering (VQA) expands on the Turing Test, as it involves the ability to answer questions about visual content. Current efforts in VQA, however, still do not fully consider whether a question about visual content is relevant and if it is not, how to edit it best to make it answerable. Question relevance has only been considered so far at the level of a whole question using binary classification and without the capability to edit a question to make it grounded and intelligible. The only exception to this is our prior research effort into question part relevance that allows for relevance and editing based on object nouns. This paper extends previous work on object relevance to determine the relevance for a question action and leverage this capability to edit an irrelevant question to make it relevant. Practical applications of such a capability include answering biometric-related queries across a set of images, including people and their action (behavioral biometrics). The feasibility of our approach is shown using Context-Collaborative VQA (C2VQA) Action/Relevance/Edit (ARE). Our results show that our proposed approach outperforms all other models for the novel tasks of question action relevance (QAR) and question action editing (QAE) by a significant margin. The ultimate goal for future research is to address full-fledged W5 + type of inquires (What, Where, When, Why, Who, and How) that are grounded to and reference video using both nouns and verbs in a collaborative context-aware fashion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

VQA: Visual Question Answering

Article 08 November 2016

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

Locating Visual Explanations for Video Question Answering

Notes

References

Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
de Vries H, Strub F, Chandar S, Pietquin O, Larochelle H, Courville A (2016) Guesswhat?! visual object discovery through multi-modal dialogue. arXiv:1611.08481
Geman D, Geman S, Hallonquist N, Younes L (2015) Visual turing test for computer vision systems. Proc Natl Acad Sci 112(12):3618–3623
Google Scholar
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2017) LSTM: a search space odyssey. IEEE Trans Neural Netw Learn Syst 28 (10):2222–2232
Article MathSciNet Google Scholar
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv:1512.03385
Huang Z, Xu W, Yu K (2015) Bidirectional lstm-crf models for sequence tagging. arXiv:1508.01991
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein M, Fei-Fei L (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv:1602.07332
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Mallya A, Lazebnik S (2016) Learning models for actions and person-object interactions with transfer to question answering. In: European conference on computer vision. Springer, pp 414–428
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4594–4602
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP), pp 1532–1543. http://www.aclweb.org/anthology/D14-1162
Ray A, Christie G, Bansal M, Batra D, Parikh D (2016) Question relevance in vqa: identifying non-visual and false-premise questions. arXiv:1606.06622
Ronchi MR, Perona P (2015) Describing common human visual actions in images. In: Xianghua Xie MWJ, Tam GKL (eds) Proceedings of the British machine vision conference (BMVC 2015). BMVA Press, pp 52.1–52.12
Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 806–813
Shih KJ, Singh S, Hoiem D (2016) Where to look: focus regions for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4613–4621
Toor AS, Wechsler H (2017) Biometrics and forensics integration using deep multi-modal semantic alignment and joint embedding. Pattern Recogn Lett, https://doi.org/10.1016/j.patrec.2017.02.012. ISSN: 0167-8655
Toor AS, Wechsler H, Nappi M (2017) Question part relevance and editing for cooperative and context-aware vqa (c2vqa). In: Proceedings of the 15th international workshop on content-based multimedia indexing. ACM, p 4
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: Conference on computer vision and pattern recognition, vol 1, p 8

Download references

Acknowledgements

We appreciate assistance from George Mason University, which provided access to GPU-based servers. These experiments were run on ARGO, a research computing cluster provided by the Office of Research Computing at George Mason University, VA. (URL:http://orc.gmu.edu)

Author information

Authors and Affiliations

Department of Computer Science, George Mason University, Fairfax, VA, USA
Andeep S. Toor & Harry Wechsler
Dipartimento di Informatica, Università di Salerno, Fisciano, Italy
Michele Nappi

Authors

Andeep S. Toor
View author publications
You can also search for this author in PubMed Google Scholar
Harry Wechsler
View author publications
You can also search for this author in PubMed Google Scholar
Michele Nappi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andeep S. Toor.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Toor, A.S., Wechsler, H. & Nappi, M. Question action relevance and editing for visual question answering. Multimed Tools Appl 78, 2921–2935 (2019). https://doi.org/10.1007/s11042-018-6097-z

Download citation

Received: 13 October 2017
Revised: 01 May 2018
Accepted: 03 May 2018
Published: 10 May 2018
Issue Date: February 2019
DOI: https://doi.org/10.1007/s11042-018-6097-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Question action relevance and editing for visual question answering

Abstract

Access this article

Similar content being viewed by others

VQA: Visual Question Answering

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

Locating Visual Explanations for Video Question Answering

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Question action relevance and editing for visual question answering

Abstract

Access this article

Similar content being viewed by others

VQA: Visual Question Answering

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

Locating Visual Explanations for Video Question Answering

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation