Abstract
The immense amount of videos being uploaded to video sharing platforms makes it impossible for a person to watch all the videos understand what happens in them. Hence, machine learning techniques are now deployed to index videos by recognizing key objects, actions and scenes or places. Summarization is another alternative as it offers to extract only important parts while covering the gist of the video content. Ideally, the user may prefer to analyze a certain action or scene by searching a query term within the video. Current summarization methods generally do not take queries into account or require exhaustive data labeling. In this work, we present a weakly supervised query-focused video summarization method. Our proposed approach makes use of semantic attributes as an indicator of query relevance and semantic attention maps to locate related regions in the frames and utilizes both within a submodular maximization framework. We conducted experiments on the recently introduced RAD dataset and obtained highly competitive results. Moreover, to better evaluate the performance of our approach on longer videos, we collected a new dataset, which consists of 10 videos from YouTube and annotated with shot-level multiple attributes. Our dataset enables much diverse set of queries that can be used to summarize a video from different perspectives with more degrees of freedom.
Similar content being viewed by others
Notes
For additional qualitative results, please refer to the project website at https://hucvl.github.io/query-specific-summarization.
References
Basavarajaiah M, Sharma P (2021) GVSUM: Generic Video summarization using deep visual features. Multimed Tools Appl 80:14459–14476
de Avila SEF, Lopes APB, da Luz A, de Albuquerque Araújo A (2011) Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32(1):56–68
Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C, Golkov V, Van Der Smagt P, Cremers D, Brox T (2015) Flownet: Learning optical flow with convolutional networks. In: Proceedings IEEE international conference on computer vision (ICCV), pp 2758–2766
Goldman DB, Curless B, Salesin D, Seitz SM (2006) Schematic storyboarding for video visualization and editing. In: ACM Transactions on graphics (TOG), vol 25. ACM, pp 862–871
Gong B, Chao W-L, Grauman K, Sha F (2014) Diverse sequential subset selection for supervised video summarization. In: Proceedings advances in neural information processing systems (neurIPS), pp 2069–2077
Gygli M, Grabner H, Riemenschneider H, Van Gool L (2014) Creating summaries from user videos. In: Proceeding sEuropean conference on computer vision (ECCV). Springer, pp 505–520
Gygli M, Grabner H, Van Gool L (2015) Video summarization by learning submodular mixtures of objectives. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 3090–3098
Iyer R, Dubal P, Dargan K, Kothawade S, Mahadev R, Kaushal V (2018) Vis-dss: An open-source toolkit for visual data selection and summarization. arXiv:1809.08846
Jiang P, Han Y (2019) Query-conditioned three-player adversarial network for video summarization. In: Proceeding international conference on multimedia retrieval (ICMR)
Kaushal V, Kothawade S, Tomar A, Iyer20218 R, Ramakrishnan G (2021) How good is a video summary? a new benchmarking dataset and evaluation framework towards realistic video summarization. arXiv:2101.10514
Khosla A, Hamid R, Lin C-J, Sundaresan N (2013) Large-scale video summarization using web-image priors. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 2698–2705
Kim G, Sigal L, Xing EP (2014) Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 4225–4232
Laganière R, Bacco R, Hocevar A, Lambert P, Païs G, Ionescu BE (2008) Video summarization from spatio-temporal features. In: Proc. ACM TRECVid video summarization workshop. ACM, pp 144–148
Lee YJ, Ghosh J, Grauman K (2012) Discovering important people and objects for egocentric video summarization. In: Proceedings IEEE computer vision and pattern recognition (CVPR). IEEE, pp 1346–1353
Lee YJ, Grauman K (2015) Predicting important objects for egocentric video summarization. Int J Comput Vis 114(1):38–55
Li Y, Merialdo B (2010) VERT: Automatic evaluation of video summaries. In: Proceedings ACM Multimedia. ACM, pp 851–854
Lin H, Bilmes J (2011) A class of submodular functions for document summarization. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, vol 1. Association for Computational Linguistics, pp 510–520
Liu W, Mei T, Zhang Y, Che C, Luo J (2015) Multi-task deep visual-semantic embedding for video thumbnail selection. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 3707–3715
Lu Z, Grauman K (2013) Story-driven summarization for egocentric video. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 2714–2721
Mendi E, Clemente HB, Bayrak C (2013) Sports video summarization based on motion analysis. Comput Electr Eng 39(3):790–796
Monfort M, Andonian A, Zhou B, Ramakrishnan K, Bargal SA, Yan T, Brown L, Fan Q, Gutfruend D, Vondrick C et al (2019) Moments in time dataset: one million videos for event understanding. IEEE Trans Pattern Anal Mach Intell, 1–8
Mundnich K, Fenster A, Khare A, Sundaram S (2021) Audiovisual highlight detection in videos. In: Proceedings IEEE ICASSP
Ngo C-W, Ma Y-F, Zhang H-J (2005) Video summarization and scene detection by graph modeling. IEEE Trans Circuits Syst Video Technol 15 (2):296–305
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42:145–175
Otani M, Nakashima Y, Rahtu E, Heikkilä J (2019) Rethinking the evaluation of video summaries. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR)
Otani M, Nakashima Y, Rahtu E, Heikkilä J, Yokoya N (2016) Video summarization using deep semantic features. In: Asian conference on computer vision
Panda R, Das A, Wu Z, Ernst J, Roy-Chowdhury AK (2017) Weakly supervised summarization of web videos. In: Proceedings IEEE international conference on computer vision (ICCV), pp 3657–3666
Pantazis G, Dimas G, Iakovidis, Salsum DK (2020) Saliency-based video summarization using generative adversarial networks. arXiv:2011.10432
Potapov D, Douze M, Harchaoui Z, Schmid C (2014) Category-specific video summarization. In: Proceedings European conference on computer vision (ECCV)
Rapantzikos K, Evangelopoulos G, Maragos P, Avrithis Y (2007) An audio-visual saliency model for movie summarization. In: Multimedia signal processing, 2007. MMSP 2007. IEEE 9th workshop on. IEEE, pp 320–323
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings IEEE international conference on computer vision (ICCV), pp 618–626
Shao J, Kang K, Loy CC, Wang X (2015) Deeply learned attributes for crowded scene understanding. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 4657–4666
Sharghi A, Borji A, Li C, Yang T (2018) Improving sequential determinantal point processes for supervised video summarization. In: Proceedings European conference on computer vision (ECCV). Springer
Sharghi A, Gong B, Shah M (2016) Query-focused extractive video summarization. In: Proceedings European conference on computer vision (ECCV). Springer, pp 3–19
Sharghi A, Laurel JS, Gong B (2017) Query-focused video summarization: dataset, evaluation, and a memory network based approach. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR)
Shrikumar A, Greenside P, Kundaje A (2017) Learning important features through propagating activation differences. In: Proceedings International conference on machine learning (ICML)
Simonyan K, Vedaldi A, Zisserman A (2014) Deep inside convolutional networks: Visualising image classification models and saliency maps. In: Proceedings International conference on learning representations (ICLR)
Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M (2015) Striving for simplicity: The all convolutional net. In: Proceedings International conference on learning representations (ICLR). Workshop Track
Sun K, Zhu J, Lei Z, Hou X, Zhang Q, Duan J, Qiu G (2017) Learning deep semantic attributes for user video summarization. In: Proceedings IEEE international conference on multimedia and expo (ICME). IEEE, pp 643–648
Tiwari V, Bhatnagar C (2021) A survey of recent work on video summarization: Approaches and techniques. Multimedia Tools and Applications
Vasudevan AB, Gygli M, Volokitin A, Van Gool L (2017) Query-adaptive video summarization via quality-aware relevance estimation. In: Proceedings ACM multimedia. ACM, pp 582–590
Wolf W (1996) Key frame selection by motion analysis. In: Proc. IEEE international conference on acoustics, speech, and signal processing, vol 2. IEEE, pp 1228–1231
Xiong B, Grauman K (2014) Detecting snap points in egocentric video with a web photo prior. In: Proceedings european conference on computer vision (ECCV). Springer, pp 282–298
Xu J, Mukherjee L, Li Y, Warner J, Rehg JM, Singh V (2015) Gaze-enabled egocentric video summarization via constrained submodular maximization. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 2235–2244
Yeung S, Fathi A, Fei-Fei L (2014) Videoset: Video summary evaluation through text. arXiv:1406.5824
Zeiler MD, Fergus R (2014) Visualizing and Understanding Convolutional Networks. Springer International Publishing, Cham, pp 818–833
Zhang J, Zhe L, Brandt J, Shen X, Stan S (2016) Top-down neural attention by excitation backprop. In: Proceedings European conference on computer vision(ECCV)
Zhang Y, Kampffmeyer M, Liang X, Tan M, Xing EP (2018) Hierarchical variational network for user-diversified i& query-focused video summarization. In: Proceedings British machine vision conference (BMVC). BMVA
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 2921–2929
Acknowledgments
This work was supported in part by GEBIP 2018 Award of the Turkish Academy of Sciences to E. Erdem, BAGEP 2021 Award of the Science Academy to A. Erdem.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Cizmeciler, K., Erdem, E. & Erdem, A. Leveraging semantic saliency maps for query-specific video summarization. Multimed Tools Appl 81, 17457–17482 (2022). https://doi.org/10.1007/s11042-022-12442-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12442-w