Skip to main content

Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 (ECCV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12363))

Included in the following conference series:

Abstract

To understand movies, humans constantly reason over the dialogues and actions shown in specific scenes and relate them to the overall storyline already seen. Inspired by this behaviour, we design ROLL, a model for knowledge-based video story question answering that leverages three crucial aspects of movie understanding: dialog comprehension, scene reasoning, and storyline recalling. In ROLL, each of these tasks is in charge of extracting rich and diverse information by 1) processing scene dialogues, 2) generating unsupervised video scene descriptions, and 3) obtaining external knowledge in a weakly supervised fashion. To answer a given question correctly, the information generated by each inspired-cognitive task is encoded via Transformers and fused through a modality weighting mechanism, which balances the information from the different sources. Exhaustive evaluation demonstrates the effectiveness of our approach, which yields a new state-of-the-art on two challenging video question answering datasets: KnowIT VQA and TVQA+.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    ‘Do androids dream of electric sheep?’ (Philip K. Dick, 1968).

  2. 2.

    https://www.wikipedia.org/.

  3. 3.

    https://www.imdb.com/.

  4. 4.

    For example, https://bigbangtrans.wordpress.com/.

  5. 5.

    Boy, girl, guy, lady, man, person, player, woman.

  6. 6.

    For example, https://the-big-bang-theory.com/.

  7. 7.

    Generating video plot summaries automatically from the whole video story is a challenging task by itself and out of the scope of this work. However, it is an interesting problem that we aim to study as a our future work.

  8. 8.

    In The Big Bang Theory, the longest summary contains 1,605 words.

References

  1. Antol, S., et al.: VQA: visual question answering. In: Proceedings of the ICCV, pp. 2425–2433 (2015)

    Google Scholar 

  2. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52

    Chapter  Google Scholar 

  3. Chen, Y., Wang, S., Zhang, W., Huang, Q.: Less is more: picking informative frames for video captioning. In: Proceedings of the ECCV, pp. 358–373 (2018)

    Google Scholar 

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the NAACL, pp. 4171–4186 (2019)

    Google Scholar 

  5. Garcia, N., Otani, M., Chu, C., Nakashima, Y.: KnowIT VQA: answering knowledge-based questions about videos. In: Proceedings of the AAAI (2020)

    Google Scholar 

  6. Garcia, N., Vogiatzis, G.: Asymmetric spatio-temporal embeddings for large-scale image-to-video retrieval. In: BMVC (2018)

    Google Scholar 

  7. Gu, J., Joty, S., Cai, J., Zhao, H., Yang, X., Wang, G.: Unpaired image captioning via scene graph alignments. In: Proceedings of the ICCV (2019)

    Google Scholar 

  8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the CVPR, pp. 770–778 (2016)

    Google Scholar 

  9. Hewlett, D., Jones, L., Lacoste, A., Gur, I.: Accurate supervised and semi-supervised machine reading for long documents. In: Proceedings of the EMNLP, pp. 2011–2020 (2017)

    Google Scholar 

  10. Hu, M., Peng, Y., Huang, Z., Li, D.: Retrieve, read, rerank: towards end-to-end multi-document reading comprehension. In: Proceedings of the ACL, pp. 2285–2295 (2019)

    Google Scholar 

  11. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, L.C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the CVPR, pp. 2901–2910 (2017)

    Google Scholar 

  12. Johnson, J., et al.: Image retrieval using scene graphs. In: Proceedings of the CVPR, pp. 3668–3678 (2015)

    Google Scholar 

  13. Kim, J., Ma, M., Kim, K., Kim, S., Yoo, C.D.: Progressive attention memory network for movie story question answering. In: Proceedings of the CVPR, pp. 8337–8346 (2019)

    Google Scholar 

  14. Kim, K.M., Choi, S.H., Kim, J.H., Zhang, B.T.: Multimodal dual attention memory for video story question answering. In: Proceedings of the ECCV, pp. 673–688 (2018)

    Google Scholar 

  15. Kim, K.M., Heo, M.O., Choi, S.H., Zhang, B.T.: DeepStory: video story QA by deep embedded memory networks. In: Proceedings of the IJCAI, pp. 2016–2022 (2017)

    Google Scholar 

  16. Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: Proceedings of the ICCV, pp. 706–715 (2017)

    Google Scholar 

  17. Lei, J., Yu, L., Bansal, M., Berg, T.L.: TVQA: localized, compositional video question answering. In: Proceedings of the EMNLP, pp. 1369–1379 (2018)

    Google Scholar 

  18. Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVQA+: spatio-temporal grounding for video question answering. arXiv preprint arXiv:1904.11574 (2019)

  19. Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: Proceedings of the ICCV, pp. 1261–1270 (2017)

    Google Scholar 

  20. Liang, J., Jiang, L., Cao, L., Li, L.J., Hauptmann, A.G.: Focal visual-text attention for visual question answering. In: Proceedings of the CVPR, pp. 6135–6143 (2018)

    Google Scholar 

  21. Liu, S., Ren, Z., Yuan, J.: SibNet: sibling convolutional encoder for video captioning. In: Proceedings of the ACM Multimedia, pp. 1425–1434 (2018)

    Google Scholar 

  22. Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: Proceedings of the CVPR, pp. 3195–3204 (2019)

    Google Scholar 

  23. Na, S., Lee, S., Kim, J., Kim, G.: A read-write memory network for movie story understanding. In: Proceedings of the ICCV, pp. 677–685 (2017)

    Google Scholar 

  24. Narasimhan, M., Lazebnik, S., Schwing, A.: Out of the box: reasoning with graph convolution nets for factual visual question answering. In: Proceedings of the NIPS, pp. 2659–2670 (2018)

    Google Scholar 

  25. Narasimhan, M., Schwing, A.G.: Straight to the facts: learning knowledge base retrieval for factual visual question answering. In: Proceedings of the ECCV, pp. 451–468 (2018)

    Google Scholar 

  26. Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the CVPR, pp. 4594–4602 (2016)

    Google Scholar 

  27. Pini, S., Cornia, M., Bolelli, F., Baraldi, L., Cucchiara, R.: M-VAD names: a dataset for video captioning with naming. Multimed. Tools Appl. 78(10), 14007–14027 (2019)

    Article  Google Scholar 

  28. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of the NIPS, pp. 91–99 (2015)

    Google Scholar 

  29. Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: Proceedings of the CVPR, pp. 3202–3212 (2015)

    Google Scholar 

  30. Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the CVPR, pp. 815–823 (2015)

    Google Scholar 

  31. Shah, S., Mishra, A., Yadati, N., Talukdar, P.P.: KVQA: knowledge-aware visual question answering. In: Proceedings of the AAAI (2019)

    Google Scholar 

  32. Shi, J., Zhang, H., Li, J.: Explainable and explicit visual reasoning over scene graphs. In: Proceedings of the CVPR, pp. 8376–8384 (2019)

    Google Scholar 

  33. Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Proceedings of the ECCV, pp. 510–526 (2016)

    Google Scholar 

  34. Speer, R., Chin, J., Havasi, C.: ConceptnNet 5.5: an open multilingual graph of general knowledge. In: Proceedings of the AAAI (2017)

    Google Scholar 

  35. Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Proceedings of the CVPR, pp. 4631–4640 (2016)

    Google Scholar 

  36. Teney, D., Liu, L., van den Hengel, A.: Graph-structured representations for visual question answering. In: Proceedings of the CVPR, pp. 1–9 (2017)

    Google Scholar 

  37. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the NIPS, pp. 5998–6008 (2017)

    Google Scholar 

  38. Vicol, P., Tapaswi, M., Castrejon, L., Fidler, S.: MovieGraphs: towards understanding human-centric situations from videos. In: Proceedings of the CVPR, pp. 8581–8590 (2018)

    Google Scholar 

  39. Wang, B., Ma, L., Zhang, W., Jiang, W., Wang, J., Liu, W.: Controllable video captioning with POS sequence guidance based on gated fusion network. In: Proceedings of the ICCV, pp. 2641–2650 (2019)

    Google Scholar 

  40. Wang, B., Xu, Y., Han, Y., Hong, R.: Movie question answering: remembering the textual cues for layered visual contents. In: Proceedings of the AAAI (2018)

    Google Scholar 

  41. Wang, P., Wu, Q., Shen, C., Dick, A., van den Hengel, A.: FVQA: fact-based visual question answering. IEEE Trans. PAMI 40(10), 2413–2427 (2018)

    Article  Google Scholar 

  42. Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Henge, A.: Explicit knowledge-based reasoning for visual question answering. In: Proceedings of the IJCAI, pp. 1290–1296 (2017)

    Google Scholar 

  43. Wang, W., Tran, D., Feiszli, M.: What makes training multi-modal networks hard? In: Proceedings of the CVPR, pp. 12695–12705 (2020)

    Google Scholar 

  44. Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long-term feature banks for detailed video understanding. In: Proceedings of the CVPR, pp. 284–293 (2019)

    Google Scholar 

  45. Wu, Q., Wang, P., Shen, C., Dick, A., van den Hengel, A.: Ask me anything: free-form visual question answering based on knowledge from external sources. In: Proceedings of the CVPR, pp. 4622–4630 (2016)

    Google Scholar 

  46. Xiong, P., Zhan, H., Wang, X., Sinha, B., Wu, Y.: Visual query answering by entity-attribute graph matching and reasoning. In: Proceedings of the CVPR, pp. 8357–8366 (2019)

    Google Scholar 

  47. Xiong, Y., Huang, Q., Guo, L., Zhou, H., Zhou, B., Lin, D.: A graph-based framework to bridge movies and synopses. In: Proceedings of the ICCV, pp. 4592–4601 (2019)

    Google Scholar 

  48. Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: Proceedings of the CVPR, pp. 5410–5419 (2017)

    Google Scholar 

  49. Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the CVPR, pp. 5288–5296 (2016)

    Google Scholar 

  50. Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Graph R-CNN for scene graph generation. In: Proceedings of the ECCV, pp. 670–685 (2018)

    Google Scholar 

  51. Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: Proceedings of the CVPR, pp. 10685–10694 (2019)

    Google Scholar 

  52. Yang, Z., Garcia, N., Chu, C., Otani, M., Nakashima, Y., Takemura, H.: BERT representations for video question answering. In: Proceedings of the WACV (2020)

    Google Scholar 

  53. Yao, L., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the ICCV, pp. 4507–4515 (2015)

    Google Scholar 

  54. Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: scene graph parsing with global context. In: Proceedings of the CVPR, pp. 5831–5840 (2018)

    Google Scholar 

  55. Zhang, J., Kalantidis, Y., Rohrbach, M., Paluri, M., Elgammal, A., Elhoseiny, M.: Large-scale visual relationship understanding. In: Proceedings of the AAAI, vol. 33, pp. 9185–9194 (2019)

    Google Scholar 

  56. Zhang, J., Shih, K.J., Elgammal, A., Tao, A., Catanzaro, B.: Graphical contrastive losses for scene graph parsing. In: Proceedings of the CVPR, pp. 11535–11543 (2019)

    Google Scholar 

  57. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. IEEE Trans. PAMI 40, 1452–1464 (2017)

    Article  Google Scholar 

  58. Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: Proceedings of the CVPR, pp. 8739–8748 (2018)

    Google Scholar 

Download references

Acknowledgement

This work was supported by a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO), and JSPS KAKENHI Nos. 18H03264 and 20K19822. We also would like to thank the anonymous reviewers for they insightful comments to improve the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Noa Garcia .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 530 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Garcia, N., Nakashima, Y. (2020). Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12363. Springer, Cham. https://doi.org/10.1007/978-3-030-58523-5_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58523-5_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58522-8

  • Online ISBN: 978-3-030-58523-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics