Skip to main content
Log in

Question action relevance and editing for visual question answering

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Visual Question Answering (VQA) expands on the Turing Test, as it involves the ability to answer questions about visual content. Current efforts in VQA, however, still do not fully consider whether a question about visual content is relevant and if it is not, how to edit it best to make it answerable. Question relevance has only been considered so far at the level of a whole question using binary classification and without the capability to edit a question to make it grounded and intelligible. The only exception to this is our prior research effort into question part relevance that allows for relevance and editing based on object nouns. This paper extends previous work on object relevance to determine the relevance for a question action and leverage this capability to edit an irrelevant question to make it relevant. Practical applications of such a capability include answering biometric-related queries across a set of images, including people and their action (behavioral biometrics). The feasibility of our approach is shown using Context-Collaborative VQA (C2VQA) Action/Relevance/Edit (ARE). Our results show that our proposed approach outperforms all other models for the novel tasks of question action relevance (QAR) and question action editing (QAE) by a significant margin. The ultimate goal for future research is to address full-fledged W5 + type of inquires (What, Where, When, Why, Who, and How) that are grounded to and reference video using both nouns and verbs in a collaborative context-aware fashion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://github.com/karpathy/neuraltalk2

  2. https://nlp.stanford.edu/projects/glove/

References

  1. Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433

  2. de Vries H, Strub F, Chandar S, Pietquin O, Larochelle H, Courville A (2016) Guesswhat?! visual object discovery through multi-modal dialogue. arXiv:1611.08481

  3. Geman D, Geman S, Hallonquist N, Younes L (2015) Visual turing test for computer vision systems. Proc Natl Acad Sci 112(12):3618–3623

    Google Scholar 

  4. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448

  5. Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2017) LSTM: a search space odyssey. IEEE Trans Neural Netw Learn Syst 28 (10):2222–2232

    Article  MathSciNet  Google Scholar 

  6. He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv:1512.03385

  7. Huang Z, Xu W, Yu K (2015) Bidirectional lstm-crf models for sequence tagging. arXiv:1508.01991

  8. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein M, Fei-Fei L (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv:1602.07332

  9. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755

  10. Mallya A, Lazebnik S (2016) Learning models for actions and person-object interactions with transfer to question answering. In: European conference on computer vision. Springer, pp 414–428

  11. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  12. Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4594–4602

  13. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP), pp 1532–1543. http://www.aclweb.org/anthology/D14-1162

  14. Ray A, Christie G, Bansal M, Batra D, Parikh D (2016) Question relevance in vqa: identifying non-visual and false-premise questions. arXiv:1606.06622

  15. Ronchi MR, Perona P (2015) Describing common human visual actions in images. In: Xianghua Xie MWJ, Tam GKL (eds) Proceedings of the British machine vision conference (BMVC 2015). BMVA Press, pp 52.1–52.12

  16. Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 806–813

  17. Shih KJ, Singh S, Hoiem D (2016) Where to look: focus regions for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4613–4621

  18. Toor AS, Wechsler H (2017) Biometrics and forensics integration using deep multi-modal semantic alignment and joint embedding. Pattern Recogn Lett, https://doi.org/10.1016/j.patrec.2017.02.012. ISSN: 0167-8655

  19. Toor AS, Wechsler H, Nappi M (2017) Question part relevance and editing for cooperative and context-aware vqa (c2vqa). In: Proceedings of the 15th international workshop on content-based multimedia indexing. ACM, p 4

  20. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  21. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659

  22. Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: Conference on computer vision and pattern recognition, vol 1, p 8

Download references

Acknowledgements

We appreciate assistance from George Mason University, which provided access to GPU-based servers. These experiments were run on ARGO, a research computing cluster provided by the Office of Research Computing at George Mason University, VA. (URL:http://orc.gmu.edu)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andeep S. Toor.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Toor, A.S., Wechsler, H. & Nappi, M. Question action relevance and editing for visual question answering. Multimed Tools Appl 78, 2921–2935 (2019). https://doi.org/10.1007/s11042-018-6097-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6097-z

Keywords

Navigation