Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions

Garcia, Noa; Nakashima, Yuta

doi:10.1007/978-3-030-58523-5_34

Noa Garcia¹² &
Yuta Nakashima¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12363))

Included in the following conference series:

European Conference on Computer Vision

3478 Accesses
14 Citations

Abstract

To understand movies, humans constantly reason over the dialogues and actions shown in specific scenes and relate them to the overall storyline already seen. Inspired by this behaviour, we design ROLL, a model for knowledge-based video story question answering that leverages three crucial aspects of movie understanding: dialog comprehension, scene reasoning, and storyline recalling. In ROLL, each of these tasks is in charge of extracting rich and diverse information by 1) processing scene dialogues, 2) generating unsupervised video scene descriptions, and 3) obtaining external knowledge in a weakly supervised fashion. To answer a given question correctly, the information generated by each inspired-cognitive task is encoded via Transformers and fused through a modality weighting mechanism, which balances the information from the different sources. Exhaustive evaluation demonstrates the effectiveness of our approach, which yields a new state-of-the-art on two challenging video question answering datasets: KnowIT VQA and TVQA+.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
‘Do androids dream of electric sheep?’ (Philip K. Dick, 1968).
2.
https://www.wikipedia.org/.
3.
https://www.imdb.com/.
4.
For example, https://bigbangtrans.wordpress.com/.
5.
Boy, girl, guy, lady, man, person, player, woman.
6.
For example, https://the-big-bang-theory.com/.
7.
Generating video plot summaries automatically from the whole video story is a challenging task by itself and out of the scope of this work. However, it is an interesting problem that we aim to study as a our future work.
8.
In The Big Bang Theory, the longest summary contains 1,605 words.

References

Antol, S., et al.: VQA: visual question answering. In: Proceedings of the ICCV, pp. 2425–2433 (2015)
Google Scholar
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52
Chapter Google Scholar
Chen, Y., Wang, S., Zhang, W., Huang, Q.: Less is more: picking informative frames for video captioning. In: Proceedings of the ECCV, pp. 358–373 (2018)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the NAACL, pp. 4171–4186 (2019)
Google Scholar
Garcia, N., Otani, M., Chu, C., Nakashima, Y.: KnowIT VQA: answering knowledge-based questions about videos. In: Proceedings of the AAAI (2020)
Google Scholar
Garcia, N., Vogiatzis, G.: Asymmetric spatio-temporal embeddings for large-scale image-to-video retrieval. In: BMVC (2018)
Google Scholar
Gu, J., Joty, S., Cai, J., Zhao, H., Yang, X., Wang, G.: Unpaired image captioning via scene graph alignments. In: Proceedings of the ICCV (2019)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the CVPR, pp. 770–778 (2016)
Google Scholar
Hewlett, D., Jones, L., Lacoste, A., Gur, I.: Accurate supervised and semi-supervised machine reading for long documents. In: Proceedings of the EMNLP, pp. 2011–2020 (2017)
Google Scholar
Hu, M., Peng, Y., Huang, Z., Li, D.: Retrieve, read, rerank: towards end-to-end multi-document reading comprehension. In: Proceedings of the ACL, pp. 2285–2295 (2019)
Google Scholar
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, L.C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the CVPR, pp. 2901–2910 (2017)
Google Scholar
Johnson, J., et al.: Image retrieval using scene graphs. In: Proceedings of the CVPR, pp. 3668–3678 (2015)
Google Scholar
Kim, J., Ma, M., Kim, K., Kim, S., Yoo, C.D.: Progressive attention memory network for movie story question answering. In: Proceedings of the CVPR, pp. 8337–8346 (2019)
Google Scholar
Kim, K.M., Choi, S.H., Kim, J.H., Zhang, B.T.: Multimodal dual attention memory for video story question answering. In: Proceedings of the ECCV, pp. 673–688 (2018)
Google Scholar
Kim, K.M., Heo, M.O., Choi, S.H., Zhang, B.T.: DeepStory: video story QA by deep embedded memory networks. In: Proceedings of the IJCAI, pp. 2016–2022 (2017)
Google Scholar
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: Proceedings of the ICCV, pp. 706–715 (2017)
Google Scholar
Lei, J., Yu, L., Bansal, M., Berg, T.L.: TVQA: localized, compositional video question answering. In: Proceedings of the EMNLP, pp. 1369–1379 (2018)
Google Scholar
Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVQA+: spatio-temporal grounding for video question answering. arXiv preprint arXiv:1904.11574 (2019)
Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: Proceedings of the ICCV, pp. 1261–1270 (2017)
Google Scholar
Liang, J., Jiang, L., Cao, L., Li, L.J., Hauptmann, A.G.: Focal visual-text attention for visual question answering. In: Proceedings of the CVPR, pp. 6135–6143 (2018)
Google Scholar
Liu, S., Ren, Z., Yuan, J.: SibNet: sibling convolutional encoder for video captioning. In: Proceedings of the ACM Multimedia, pp. 1425–1434 (2018)
Google Scholar
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: Proceedings of the CVPR, pp. 3195–3204 (2019)
Google Scholar
Na, S., Lee, S., Kim, J., Kim, G.: A read-write memory network for movie story understanding. In: Proceedings of the ICCV, pp. 677–685 (2017)
Google Scholar
Narasimhan, M., Lazebnik, S., Schwing, A.: Out of the box: reasoning with graph convolution nets for factual visual question answering. In: Proceedings of the NIPS, pp. 2659–2670 (2018)
Google Scholar
Narasimhan, M., Schwing, A.G.: Straight to the facts: learning knowledge base retrieval for factual visual question answering. In: Proceedings of the ECCV, pp. 451–468 (2018)
Google Scholar
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the CVPR, pp. 4594–4602 (2016)
Google Scholar
Pini, S., Cornia, M., Bolelli, F., Baraldi, L., Cucchiara, R.: M-VAD names: a dataset for video captioning with naming. Multimed. Tools Appl. 78(10), 14007–14027 (2019)
Article Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of the NIPS, pp. 91–99 (2015)
Google Scholar
Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: Proceedings of the CVPR, pp. 3202–3212 (2015)
Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the CVPR, pp. 815–823 (2015)
Google Scholar
Shah, S., Mishra, A., Yadati, N., Talukdar, P.P.: KVQA: knowledge-aware visual question answering. In: Proceedings of the AAAI (2019)
Google Scholar
Shi, J., Zhang, H., Li, J.: Explainable and explicit visual reasoning over scene graphs. In: Proceedings of the CVPR, pp. 8376–8384 (2019)
Google Scholar
Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Proceedings of the ECCV, pp. 510–526 (2016)
Google Scholar
Speer, R., Chin, J., Havasi, C.: ConceptnNet 5.5: an open multilingual graph of general knowledge. In: Proceedings of the AAAI (2017)
Google Scholar
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Proceedings of the CVPR, pp. 4631–4640 (2016)
Google Scholar
Teney, D., Liu, L., van den Hengel, A.: Graph-structured representations for visual question answering. In: Proceedings of the CVPR, pp. 1–9 (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the NIPS, pp. 5998–6008 (2017)
Google Scholar
Vicol, P., Tapaswi, M., Castrejon, L., Fidler, S.: MovieGraphs: towards understanding human-centric situations from videos. In: Proceedings of the CVPR, pp. 8581–8590 (2018)
Google Scholar
Wang, B., Ma, L., Zhang, W., Jiang, W., Wang, J., Liu, W.: Controllable video captioning with POS sequence guidance based on gated fusion network. In: Proceedings of the ICCV, pp. 2641–2650 (2019)
Google Scholar
Wang, B., Xu, Y., Han, Y., Hong, R.: Movie question answering: remembering the textual cues for layered visual contents. In: Proceedings of the AAAI (2018)
Google Scholar
Wang, P., Wu, Q., Shen, C., Dick, A., van den Hengel, A.: FVQA: fact-based visual question answering. IEEE Trans. PAMI 40(10), 2413–2427 (2018)
Article Google Scholar
Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Henge, A.: Explicit knowledge-based reasoning for visual question answering. In: Proceedings of the IJCAI, pp. 1290–1296 (2017)
Google Scholar
Wang, W., Tran, D., Feiszli, M.: What makes training multi-modal networks hard? In: Proceedings of the CVPR, pp. 12695–12705 (2020)
Google Scholar
Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long-term feature banks for detailed video understanding. In: Proceedings of the CVPR, pp. 284–293 (2019)
Google Scholar
Wu, Q., Wang, P., Shen, C., Dick, A., van den Hengel, A.: Ask me anything: free-form visual question answering based on knowledge from external sources. In: Proceedings of the CVPR, pp. 4622–4630 (2016)
Google Scholar
Xiong, P., Zhan, H., Wang, X., Sinha, B., Wu, Y.: Visual query answering by entity-attribute graph matching and reasoning. In: Proceedings of the CVPR, pp. 8357–8366 (2019)
Google Scholar
Xiong, Y., Huang, Q., Guo, L., Zhou, H., Zhou, B., Lin, D.: A graph-based framework to bridge movies and synopses. In: Proceedings of the ICCV, pp. 4592–4601 (2019)
Google Scholar
Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: Proceedings of the CVPR, pp. 5410–5419 (2017)
Google Scholar
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the CVPR, pp. 5288–5296 (2016)
Google Scholar
Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Graph R-CNN for scene graph generation. In: Proceedings of the ECCV, pp. 670–685 (2018)
Google Scholar
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: Proceedings of the CVPR, pp. 10685–10694 (2019)
Google Scholar
Yang, Z., Garcia, N., Chu, C., Otani, M., Nakashima, Y., Takemura, H.: BERT representations for video question answering. In: Proceedings of the WACV (2020)
Google Scholar
Yao, L., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the ICCV, pp. 4507–4515 (2015)
Google Scholar
Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: scene graph parsing with global context. In: Proceedings of the CVPR, pp. 5831–5840 (2018)
Google Scholar
Zhang, J., Kalantidis, Y., Rohrbach, M., Paluri, M., Elgammal, A., Elhoseiny, M.: Large-scale visual relationship understanding. In: Proceedings of the AAAI, vol. 33, pp. 9185–9194 (2019)
Google Scholar
Zhang, J., Shih, K.J., Elgammal, A., Tao, A., Catanzaro, B.: Graphical contrastive losses for scene graph parsing. In: Proceedings of the CVPR, pp. 11535–11543 (2019)
Google Scholar
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. IEEE Trans. PAMI 40, 1452–1464 (2017)
Article Google Scholar
Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: Proceedings of the CVPR, pp. 8739–8748 (2018)
Google Scholar

Download references

Acknowledgement

This work was supported by a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO), and JSPS KAKENHI Nos. 18H03264 and 20K19822. We also would like to thank the anonymous reviewers for they insightful comments to improve the paper.

Author information

Authors and Affiliations

Osaka University, Suita, Japan
Noa Garcia & Yuta Nakashima

Authors

Noa Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Yuta Nakashima
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Noa Garcia .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 530 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Garcia, N., Nakashima, Y. (2020). Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12363. Springer, Cham. https://doi.org/10.1007/978-3-030-58523-5_34

Download citation

DOI: https://doi.org/10.1007/978-3-030-58523-5_34
Published: 04 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58522-8
Online ISBN: 978-3-030-58523-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics