Abstract
There has been emerging interest recently in three dimensional (3D) convolutional neural networks (CNNs) as a powerful tool to encode spatio-temporal representations in videos, by adding a third temporal dimension to pre-existing 2D CNNs. In this chapter, we discuss the effectiveness of using 3D convolutions to capture the important motion features in the context of video saliency prediction. The method filters the spatio-temporal features across multiple adjacent frames. This cubic convolution could be effectively applied on a dense sequence of frames propagating the previous frames’ information into the current, reflecting processing mechanisms of the human visual system for better saliency prediction. We extensively evaluate the model performance compared to the state-of-the-art video saliency models on both 2D and 360\(^\circ \) videos. The architecture can efficiently learn expressive spatio-temporal representations and produce high quality video saliency maps on three large-scale 2D datasets, DHF1K, UCF-SPORTS and DAVIS. Investigations on the 360\(^\circ \) Salient360! and datasets show how the approach can generalise.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Adel Bargal, S., Zunino, A., Kim, D., Zhang, J., Murino, V., Sclaroff, S.: Excitation backprop for RNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1440–1449 (2018)
Amudha, J., Radha, D., Naresh, P.: Video shot detection using saliency measure. Int. J. Comput. Appl. 975, 8887 (2012). Citeseer
Arun, S.: Turning visual search time on its head. Vis. Res. 74, 86–92 (2012)
Bak, C., Kocak, A., Erdem, E., Erdem, A.: Spatio-temporal saliency networks for dynamic saliency prediction. IEEE Trans. Multimed. 20(7), 1688–1698 (2018)
Boccignone, G.: Nonparametric Bayesian attentive video analysis. In: 2008 19th International Conference on Pattern Recognition, pp. 1–4. IEEE (2008)
Boccignone, G., Ferraro, M.: Modelling gaze shift as a constrained random walk. Phys. A 331(1–2), 207–218 (2004)
Borji, A.: Saliency prediction in the deep learning era: an empirical investigation. arXiv preprint arXiv:1810.03716 (2018)
Borji, A., Itti, L.: State-of-the-art in visual attention modeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 185–207 (2013)
Borji, A., Itti, L.: CAT 2000: a large scale fixation dataset for boosting saliency research. arXiv preprint arXiv:1505.03581 (2015)
Bruce, N., Tsotsos, J.: Saliency based on information maximization. In: Advances in Neural Information Processing Systems, pp. 155–162 (2006)
Bylinskii, Z., et al.: MIT saliency benchmark (2015)
Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics tell us about saliency models? IEEE Trans. Pattern Anal. Mach. Intell. 41(3), 740–757 (2018)
Chao, F.Y., Zhang, L., Hamidouche, W., Deforges, O.: Salgan360: visual saliency prediction on 360 degree images with generative adversarial networks. In: 2018 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 01–04. IEEE (2018)
Cheng, H.T., Chao, C.H., Dong, J.D., Wen, H.K., Liu, T.L., Sun, M.: Cube padding for weakly-supervised saliency prediction in 360 videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1420–1429 (2018)
Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: A deep multi-level network for saliency prediction. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 3488–3493. IEEE (2016)
Dahou Djilali, Y.A., Sayah, M., McGuinness, K., O’Connor, N.E.: 3DSAL: An efficient 3D-CNN architecture for video saliency prediction. In: Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP, pp. 27–36. INSTICC, SciTePress (2020). https://doi.org/10.5220/0008875600270036
David, E.J., Gutiérrez, J., Coutrot, A., Da Silva, M.P., Callet, P.L.: A dataset of head and eye movements for 360 videos. In: Proceedings of the 9th ACM Multimedia Systems Conference, pp. 432–437 (2018)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Droste, R., Jiao, J., Noble, J.A.: Unified image and video saliency modeling. arXiv preprint arXiv:2003.05477 (2020)
Ehinger, K.A., Hidalgo-Sotelo, B., Torralba, A., Oliva, A.: Modelling search for people in 900 scenes: a combined source model of eye guidance. Vis. Cogn. 17(6–7), 945–978 (2009)
Frintrop, S., Jensfelt, P.: Attentional landmarks and active gaze control for visual SLAM. IEEE Trans. Rob. 24(5), 1054–1065 (2008)
Gao, D., Vasconcelos, N.: Discriminant saliency for visual recognition from cluttered scenes. In: Advances in Neural Information Processing Systems, pp. 481–488 (2005)
Gao, J., Huang, Y., Yu, H.H.: Method and system for video summarization, 31 May 2016. US Patent 9,355,635
Garcia-Diaz, A., Fdez-Vidal, X.R., Pardo, X.M., Dosil, R.: Decorrelation and distinctiveness provide with human-like saliency. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2009. LNCS, vol. 5807, pp. 343–354. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04697-1_32
Garcia-Diaz, A., Fdez-Vidal, X.R., Pardo, X.M., Dosil, R.: Saliency from hierarchical adaptation through decorrelation and variance normalization. Image Vis. Comput. 30(1), 51–64 (2012)
Goferman, S., Zelnik-Manor, L., Tal, A.: Context-aware saliency detection. IEEE Trans. Pattern Anal. Mach. Intell. 34(10), 1915–1926 (2012)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Guo, C., Ma, Q., Zhang, L.: Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)
Guo, C., Zhang, L.: A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression. IEEE Trans. Image Process. 19(1), 185–198 (2009)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Hossein Khatoonabadi, S., Vasconcelos, N., Bajic, I.V., Shan, Y.: How many bits does it take for a stimulus to be salient? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5501–5510 (2015)
Huang, X., Shen, C., Boix, X., Zhao, Q.: SALICON: reducing the semantic gap in saliency prediction by adapting deep neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 262–270 (2015)
Itti, L.: Automatic foveation for video compression using a neurobiological model of visual attention. IEEE Trans. Image Process. 13(10), 1304–1318 (2004)
Itti, L., Baldi, P.: Bayesian surprise attracts human attention. Vision. Res. 49(10), 1295–1306 (2009)
Itti, L., Koch, C.: A saliency-based search mechanism for overt and covert shifts of visual attention. Vis. Res. 40(10–12), 1489–1506 (2000)
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 11, 1254–1259 (1998)
Jacobson, N., Lee, Y.L., Mahadevan, V., Vasconcelos, N., Nguyen, T.Q.: A novel approach to FRUC using discriminant saliency and frame segmentation. IEEE Trans. Image Process. 19(11), 2924–2934 (2010)
James, W.: The Principles of Psychology, vol. 1. Cosimo, Inc. (1950)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Jia, S., Bruce, N.D.: EML-NET: an expandable multi-layer network for saliency prediction. Image Vis. Comput. 95, 103887 (2020)
Jiang, L., Xu, M., Liu, T., Qiao, M., Wang, Z.: DeepVS: a deep learning based video saliency prediction approach. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 602–617 (2018)
Jiang, L., Xu, M., Wang, Z.: Predicting video saliency with object-to-motion CNN and two-layer convolutional LSTM. arXiv preprint arXiv:1709.06316 (2017)
Jiang, M., Huang, S., Duan, J., Zhao, Q.: SALICON: saliency in context. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015
Judd, T., Durand, F., Torralba, A.: A benchmark of computational models of saliency to predict human fixations. MIT Technical report (2012)
Judd, T., Durand, F., Torralba, A.: A benchmark of computational models of saliency to predict human fixations (2012)
Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 2106–2113. IEEE (2009)
Kienzle, W., Franz, M.O., Schölkopf, B., Wichmann, F.A.: Center-surround patterns emerge as optimal predictors for human saccade targets. J. Vis. 9(5), 7 (2009)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Koch, C., Ullman, S.: Shifts in selective visual attention: towards the underlying neural circuitry. In: Vaina, L.M. (ed.) Matters of intelligence. Synthese Library (Studies in Epistemology, Logic, Methodology, and Philosophy of Science), vol. 188, pp. 115–141. Springer, Dordrecht (1987). https://doi.org/10.1007/978-94-009-3833-5_5
Kootstra, G., Nederveen, A., De Boer, B.: Paying attention to symmetry. In: British Machine Vision Conference (BMVC2008), pp. 1115–1125. The British Machine Vision Association and Society for Pattern Recognition (2008)
Kotseruba, I., Wloka, C., Rasouli, A., Tsotsos, J.K.: Do saliency models detect odd-one-out targets? New datasets and evaluations. arXiv preprint arXiv:2005.06583 (2020)
Kruthiventi, S.S., Ayush, K., Babu, R.V.: DeepFix: a fully convolutional neural network for predicting human eye fixations. IEEE Trans. Image Process. 26(9), 4446–4456 (2017)
Kümmerer, M., Theis, L., Bethge, M.: Deep gaze i: Boosting saliency prediction with feature maps trained on ImageNet. arXiv preprint arXiv:1411.1045 (2014)
Lai, Q., Wang, W., Sun, H., Shen, J.: Video saliency prediction using spatiotemporal residual attentive networks. IEEE Trans. Image Process. 29, 1113–1126 (2019)
Le Meur, O., Le Callet, P., Barba, D.: Predicting visual fixations on video based on low-level visual features. Vis. Res. 47(19), 2483–2498 (2007)
Le Meur, O., Le Callet, P., Barba, D., Thoreau, D.: A coherent computational approach to model bottom-up visual attention. IEEE Trans. Pattern Anal. Mach. Intell. 28(5), 802–817 (2006)
Leboran, V., Garcia-Diaz, A., Fdez-Vidal, X.R., Pardo, X.M.: Dynamic whitening saliency. IEEE Trans. Pattern Anal. Mach. Intell. 39(5), 893–907 (2017)
Li, X., et al.: DeepSaliency: multi-task deep neural network model for salient object detection. IEEE Trans. Image Process. 25(8), 3919–3930 (2016)
Linardos, P., Mohedano, E., Nieto, J.J., O’Connor, N.E., Giro-i Nieto, X., McGuinness, K.: Simple vs complex temporal recurrences for video saliency prediction. arXiv preprint arXiv:1907.01869 (2019)
Liu, N., Han, J.: A deep spatial contextual long-term recurrent convolutional network for saliency detection. IEEE Trans. Image Process. 27(7), 3264–3274 (2018)
Liu, N., Han, J., Zhang, D., Wen, S., Liu, T.: Predicting eye fixations using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 362–370 (2015)
Ma, Y.F., Hua, X.S., Lu, L., Zhang, H.J.: A generic framework of user attention model and its application in video summarization. IEEE Trans. Multimed. 7(5), 907–919 (2005)
Mahadevan, V., Vasconcelos, N.: Saliency-based discriminant tracking. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1007–1013. IEEE (2009)
Mancas, M., Ferrera, V.P., Riche, N., Taylor, J.G.: From Human Attention to Computational Attention, vol. 2. Springer, Heidelberg (2016). https://doi.org/10.1007/978-1-4939-3435-5
Marat, S., Guironnet, M., Pellerin, D.: Video summarization using a visual attention model. In: 2007 15th European Signal Processing Conference, pp. 1784–1788. IEEE (2007)
Marat, S., Phuoc, T.H., Granjon, L., Guyader, N., Pellerin, D., Guérin-Dugué, A.: Modelling spatio-temporal saliency to predict gaze direction for short videos. Int. J. Comput. Vis. 82(3), 231 (2009)
Mathe, S., Sminchisescu, C.: Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(7), 1408–1424 (2014)
Mathe, S., Sminchisescu, C.: Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(7), 1408–1424 (2015)
Milanese, R.: Detecting salient regions in an image: from biological evidence to computer implementation. Ph. D Theses, the University of Geneva (1993)
Min, K., Corso, J.J.: TASED-net: temporally-aggregating spatial encoder-decoder network for video saliency detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2394–2403 (2019)
Mital, P.K., Smith, T.J., Hill, R.L., Henderson, J.M.: Clustering of gaze during dynamic scene viewing is predicted by motion. Cogn. Comput. 3(1), 5–24 (2011)
Murray, N., Vanrell, M., Otazu, X., Parraga, C.A.: Saliency estimation using a non-parametric low-level vision model. In: CVPR 2011, pp. 433–440. IEEE (2011)
Navalpakkam, V., Itti, L.: An integrated model of top-down and bottom-up attention for optimizing detection speed. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), vol. 2, pp. 2049–2056. IEEE (2006)
Nwankpa, C., Ijomah, W., Gachagan, A., Marshall, S.: Activation functions: comparison of trends in practice and research for deep learning. CoRR abs/1811.03378 (2018). http://arxiv.org/abs/1811.03378
Oliva, A., Torralba, A., Castelhano, M.S., Henderson, J.M.: Top-down control of visual attention in object detection. In: Proceedings 2003 International Conference on Image Processing (Cat. No. 03CH37429), vol. 1, p. I-253. IEEE (2003)
Ouerhani, N., Hügli, H.: Real-time visual attention on a massively parallel SIMD architecture. Real-Time Imaging 9(3), 189–196 (2003)
Pan, J., et al.: SalGAN: visual saliency prediction with generative adversarial networks. arXiv preprint arXiv:1701.01081 (2017)
Pan, J., Sayrol, E., Giro-i Nieto, X., McGuinness, K., O’Connor, N.E.: Shallow and deep convolutional networks for saliency prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 598–606 (2016)
Panagiotis, L., Eva, M., Monica, C., Cathal, G., Xavier, G.i.N.: Temporal saliency adaptation in egocentric videos (2018)
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732 (2016)
Prechelt, L.: Early stopping - but when? In: Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 1524, pp. 55–69. Springer, Heidelberg (1998). https://doi.org/10.1007/3-540-49430-8_3
Privitera, C.M., Stark, L.W.: Algorithms for defining visual regions-of-interest: comparison with eye fixations. IEEE Trans. Pattern Anal. Mach. Intell. 22(9), 970–982 (2000)
Rai, Y., Le Callet, P., Guillotel, P.: Which saliency weighting for omni directional image quality assessment? In: 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6. IEEE (2017)
Ren, Z., Gao, S., Chia, L.T., Tsang, I.W.H.: Region-based saliency detection and its application in object recognition. IEEE Trans. Circ. Syst. Video Technol. 24(5), 769–779 (2013)
Rimey, R.D., Brown, C.M.: Controlling eye movements with hidden Markov models. Int. J. Comput. Vis. 7(1), 47–65 (1991)
Rosenholtz, R.: A simple saliency model predicts a number of motion popout phenomena. Vis. Res. 39(19), 3157–3163 (1999)
Rudoy, D., Goldman, D.B., Shechtman, E., Zelnik-Manor, L.: Learning video saliency from human gaze using candidate selection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1147–1154 (2013)
Salvucci, D.D.: An integrated model of eye movements and visual encoding. Cogn. Syst. Res. 1(4), 201–220 (2001)
Saslow, M.: Effects of components of displacement-step stimuli upon latency for saccadic eye movement. Josa 57(8), 1024–1029 (1967)
Seo, H.J., Milanfar, P.: Static and space-time visual saliency detection by self-resemblance. J. Vis. 9(12), 15 (2009)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Sohlberg, M.M., Mateer, C.A.: Introduction to Cognitive Rehabilitation: Theory and Practice. Guilford Press (1989)
Sprague, N., Ballard, D.: Eye movements for reward maximization. In: Advances in Neural Information Processing Systems, pp. 1467–1474 (2004)
Suzuki, T., Yamanaka, T.: Saliency map estimation for omni-directional image considering prior distributions. In: 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 2079–2084. IEEE (2018)
Torralba, A.: Modeling global scene factors in attention. JOSA A 20(7), 1407–1418 (2003)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cogn. Psychol. 12(1), 97–136 (1980)
Tsotsos, J.K., Culhane, S.M., Wai, W.Y.K., Lai, Y., Davis, N., Nuflo, F.: Modeling visual attention via selective tuning. Artif. Intell. 78(1–2), 507–545 (1995)
Vig, E., Dorr, M., Cox, D.: Large-scale optimization of hierarchical features for saliency prediction in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2798–2805 (2014)
Wang, W., Shen, J., Guo, F., Cheng, M.M., Borji, A.: Revisiting video saliency: a large-scale benchmark and a new model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4894–4903 (2018)
Wang, W., Shen, J., Xie, J., Cheng, M.M., Ling, H., Borji, A.: Revisiting video saliency prediction in the deep learning era. IEEE Trans. Pattern Anal. Mach. Intell. (2019)
Wang, W., Shen, J., Yang, R., Porikli, F.: Saliency-aware video object segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 40(1), 20–33 (2017)
Wolfe, J.M.: Guided search 2.0 a revised model of visual search. Psychon. Bull. Rev. 1(2), 202–238 (1994)
Wu, X., Yuen, P.C., Liu, C., Huang, J.: Shot boundary detection: an information saliency approach. In: 2008 Congress on Image and Signal Processing, vol. 2, pp. 808–812. IEEE (2008)
Xu, M., Li, C., Zhang, S., Le Callet, P.: State-of-the-art in 360 video/image processing: perception, assessment and compression. IEEE J. Sel. Top. Signal Process. 14(1), 5–26 (2020)
Xu, M., Song, Y., Wang, J., Qiao, M., Huo, L., Wang, Z.: Predicting head movement in panoramic video: a deep reinforcement learning approach. IEEE Trans. Pattern Anal. Mach. Intell. 41(11), 2693–2708 (2018)
Xu, Y., et al.: Gaze prediction in dynamic 360 immersive videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5333–5342 (2018)
Zhang, J., Sclaroff, S.: Saliency detection: a boolean map approach. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 153–160 (2013)
Zhang, K., Chen, Z.: Video saliency prediction based on spatial-temporal two-stream network. IEEE Trans. Circ. Syst. Video Technol. 29(12), 3544–3557 (2018)
Zhang, L., Tong, M.H., Marks, T.K., Shan, H., Cottrell, G.W.: SUN: a Bayesian framework for saliency using natural statistics. J. Vis. 8(7), 32 (2008)
Zhang, Z., Xu, Y., Yu, J., Gao, S.: Saliency detection in 360 videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 488–503 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Dahou Djilali, Y.A., Sayah, M., McGuinness, K., O’Connor, N.E. (2022). On the Use of 3D CNNs for Video Saliency Modeling. In: Bouatouch, K., et al. Computer Vision, Imaging and Computer Graphics Theory and Applications. VISIGRAPP 2020. Communications in Computer and Information Science, vol 1474. Springer, Cham. https://doi.org/10.1007/978-3-030-94893-1_21
Download citation
DOI: https://doi.org/10.1007/978-3-030-94893-1_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-94892-4
Online ISBN: 978-3-030-94893-1
eBook Packages: Computer ScienceComputer Science (R0)