Skip to main content
Log in

Leveraging semantic saliency maps for query-specific video summarization

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The immense amount of videos being uploaded to video sharing platforms makes it impossible for a person to watch all the videos understand what happens in them. Hence, machine learning techniques are now deployed to index videos by recognizing key objects, actions and scenes or places. Summarization is another alternative as it offers to extract only important parts while covering the gist of the video content. Ideally, the user may prefer to analyze a certain action or scene by searching a query term within the video. Current summarization methods generally do not take queries into account or require exhaustive data labeling. In this work, we present a weakly supervised query-focused video summarization method. Our proposed approach makes use of semantic attributes as an indicator of query relevance and semantic attention maps to locate related regions in the frames and utilizes both within a submodular maximization framework. We conducted experiments on the recently introduced RAD dataset and obtained highly competitive results. Moreover, to better evaluate the performance of our approach on longer videos, we collected a new dataset, which consists of 10 videos from YouTube and annotated with shot-level multiple attributes. Our dataset enables much diverse set of queries that can be used to summarize a video from different perspectives with more degrees of freedom.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. https://hucvl.github.io/query-specific-summarization/.

  2. For additional qualitative results, please refer to the project website at https://hucvl.github.io/query-specific-summarization.

References

  1. Basavarajaiah M, Sharma P (2021) GVSUM: Generic Video summarization using deep visual features. Multimed Tools Appl 80:14459–14476

    Article  Google Scholar 

  2. de Avila SEF, Lopes APB, da Luz A, de Albuquerque Araújo A (2011) Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32(1):56–68

    Article  Google Scholar 

  3. Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C, Golkov V, Van Der Smagt P, Cremers D, Brox T (2015) Flownet: Learning optical flow with convolutional networks. In: Proceedings IEEE international conference on computer vision (ICCV), pp 2758–2766

  4. Goldman DB, Curless B, Salesin D, Seitz SM (2006) Schematic storyboarding for video visualization and editing. In: ACM Transactions on graphics (TOG), vol 25. ACM, pp 862–871

  5. Gong B, Chao W-L, Grauman K, Sha F (2014) Diverse sequential subset selection for supervised video summarization. In: Proceedings advances in neural information processing systems (neurIPS), pp 2069–2077

  6. Gygli M, Grabner H, Riemenschneider H, Van Gool L (2014) Creating summaries from user videos. In: Proceeding sEuropean conference on computer vision (ECCV). Springer, pp 505–520

  7. Gygli M, Grabner H, Van Gool L (2015) Video summarization by learning submodular mixtures of objectives. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 3090–3098

  8. Iyer R, Dubal P, Dargan K, Kothawade S, Mahadev R, Kaushal V (2018) Vis-dss: An open-source toolkit for visual data selection and summarization. arXiv:1809.08846

  9. Jiang P, Han Y (2019) Query-conditioned three-player adversarial network for video summarization. In: Proceeding international conference on multimedia retrieval (ICMR)

  10. Kaushal V, Kothawade S, Tomar A, Iyer20218 R, Ramakrishnan G (2021) How good is a video summary? a new benchmarking dataset and evaluation framework towards realistic video summarization. arXiv:2101.10514

  11. Khosla A, Hamid R, Lin C-J, Sundaresan N (2013) Large-scale video summarization using web-image priors. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 2698–2705

  12. Kim G, Sigal L, Xing EP (2014) Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 4225–4232

  13. Laganière R, Bacco R, Hocevar A, Lambert P, Païs G, Ionescu BE (2008) Video summarization from spatio-temporal features. In: Proc. ACM TRECVid video summarization workshop. ACM, pp 144–148

  14. Lee YJ, Ghosh J, Grauman K (2012) Discovering important people and objects for egocentric video summarization. In: Proceedings IEEE computer vision and pattern recognition (CVPR). IEEE, pp 1346–1353

  15. Lee YJ, Grauman K (2015) Predicting important objects for egocentric video summarization. Int J Comput Vis 114(1):38–55

    Article  MathSciNet  Google Scholar 

  16. Li Y, Merialdo B (2010) VERT: Automatic evaluation of video summaries. In: Proceedings ACM Multimedia. ACM, pp 851–854

  17. Lin H, Bilmes J (2011) A class of submodular functions for document summarization. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, vol 1. Association for Computational Linguistics, pp 510–520

  18. Liu W, Mei T, Zhang Y, Che C, Luo J (2015) Multi-task deep visual-semantic embedding for video thumbnail selection. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 3707–3715

  19. Lu Z, Grauman K (2013) Story-driven summarization for egocentric video. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 2714–2721

  20. Mendi E, Clemente HB, Bayrak C (2013) Sports video summarization based on motion analysis. Comput Electr Eng 39(3):790–796

    Article  Google Scholar 

  21. Monfort M, Andonian A, Zhou B, Ramakrishnan K, Bargal SA, Yan T, Brown L, Fan Q, Gutfruend D, Vondrick C et al (2019) Moments in time dataset: one million videos for event understanding. IEEE Trans Pattern Anal Mach Intell, 1–8

  22. Mundnich K, Fenster A, Khare A, Sundaram S (2021) Audiovisual highlight detection in videos. In: Proceedings IEEE ICASSP

  23. Ngo C-W, Ma Y-F, Zhang H-J (2005) Video summarization and scene detection by graph modeling. IEEE Trans Circuits Syst Video Technol 15 (2):296–305

    Article  Google Scholar 

  24. Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42:145–175

    Article  Google Scholar 

  25. Otani M, Nakashima Y, Rahtu E, Heikkilä J (2019) Rethinking the evaluation of video summaries. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR)

  26. Otani M, Nakashima Y, Rahtu E, Heikkilä J, Yokoya N (2016) Video summarization using deep semantic features. In: Asian conference on computer vision

  27. Panda R, Das A, Wu Z, Ernst J, Roy-Chowdhury AK (2017) Weakly supervised summarization of web videos. In: Proceedings IEEE international conference on computer vision (ICCV), pp 3657–3666

  28. Pantazis G, Dimas G, Iakovidis, Salsum DK (2020) Saliency-based video summarization using generative adversarial networks. arXiv:2011.10432

  29. Potapov D, Douze M, Harchaoui Z, Schmid C (2014) Category-specific video summarization. In: Proceedings European conference on computer vision (ECCV)

  30. Rapantzikos K, Evangelopoulos G, Maragos P, Avrithis Y (2007) An audio-visual saliency model for movie summarization. In: Multimedia signal processing, 2007. MMSP 2007. IEEE 9th workshop on. IEEE, pp 320–323

  31. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings IEEE international conference on computer vision (ICCV), pp 618–626

  32. Shao J, Kang K, Loy CC, Wang X (2015) Deeply learned attributes for crowded scene understanding. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 4657–4666

  33. Sharghi A, Borji A, Li C, Yang T (2018) Improving sequential determinantal point processes for supervised video summarization. In: Proceedings European conference on computer vision (ECCV). Springer

  34. Sharghi A, Gong B, Shah M (2016) Query-focused extractive video summarization. In: Proceedings European conference on computer vision (ECCV). Springer, pp 3–19

  35. Sharghi A, Laurel JS, Gong B (2017) Query-focused video summarization: dataset, evaluation, and a memory network based approach. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR)

  36. Shrikumar A, Greenside P, Kundaje A (2017) Learning important features through propagating activation differences. In: Proceedings International conference on machine learning (ICML)

  37. Simonyan K, Vedaldi A, Zisserman A (2014) Deep inside convolutional networks: Visualising image classification models and saliency maps. In: Proceedings International conference on learning representations (ICLR)

  38. Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M (2015) Striving for simplicity: The all convolutional net. In: Proceedings International conference on learning representations (ICLR). Workshop Track

  39. Sun K, Zhu J, Lei Z, Hou X, Zhang Q, Duan J, Qiu G (2017) Learning deep semantic attributes for user video summarization. In: Proceedings IEEE international conference on multimedia and expo (ICME). IEEE, pp 643–648

  40. Tiwari V, Bhatnagar C (2021) A survey of recent work on video summarization: Approaches and techniques. Multimedia Tools and Applications

  41. Vasudevan AB, Gygli M, Volokitin A, Van Gool L (2017) Query-adaptive video summarization via quality-aware relevance estimation. In: Proceedings ACM multimedia. ACM, pp 582–590

  42. Wolf W (1996) Key frame selection by motion analysis. In: Proc. IEEE international conference on acoustics, speech, and signal processing, vol 2. IEEE, pp 1228–1231

  43. Xiong B, Grauman K (2014) Detecting snap points in egocentric video with a web photo prior. In: Proceedings european conference on computer vision (ECCV). Springer, pp 282–298

  44. Xu J, Mukherjee L, Li Y, Warner J, Rehg JM, Singh V (2015) Gaze-enabled egocentric video summarization via constrained submodular maximization. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 2235–2244

  45. Yeung S, Fathi A, Fei-Fei L (2014) Videoset: Video summary evaluation through text. arXiv:1406.5824

  46. Zeiler MD, Fergus R (2014) Visualizing and Understanding Convolutional Networks. Springer International Publishing, Cham, pp 818–833

    Google Scholar 

  47. Zhang J, Zhe L, Brandt J, Shen X, Stan S (2016) Top-down neural attention by excitation backprop. In: Proceedings European conference on computer vision(ECCV)

  48. Zhang Y, Kampffmeyer M, Liang X, Tan M, Xing EP (2018) Hierarchical variational network for user-diversified i& query-focused video summarization. In: Proceedings British machine vision conference (BMVC). BMVA

  49. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 2921–2929

Download references

Acknowledgments

This work was supported in part by GEBIP 2018 Award of the Turkish Academy of Sciences to E. Erdem, BAGEP 2021 Award of the Science Academy to A. Erdem.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Erkut Erdem.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cizmeciler, K., Erdem, E. & Erdem, A. Leveraging semantic saliency maps for query-specific video summarization. Multimed Tools Appl 81, 17457–17482 (2022). https://doi.org/10.1007/s11042-022-12442-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12442-w

Keywords

Navigation