Abstract
In this paper, we propose a two-stage temporal proposal algorithm for the action detection task of long untrimmed videos. In the first stage, we propose a novel prior-minor watershed algorithm for action proposals with precise prior watershed proposal algorithm and minor supplementary sliding window algorithm. Here, we propose the correctness discriminator to fill the proposals that watershed proposal algorithm may omit with the sliding window proposals. In the second stage, an extended context pooling (ECP) is firstly proposed with two modules (internal and context). The context information module of ECP can structure the proposals and enhance the extended features of action proposals. Different level of ECP is introduced to model the action proposal region and make its extended context region more targeted and precise. Then, we propose a temporal context regression network, which adopts a multi-task loss to realize the training of the temporal coordinate regression and the action/background classification simultaneously, and outputs the precise temporal boundaries of the proposals. Here, we also propose prior-minor ranking to balance the effect of the prior watershed proposals and the minor supplementary proposals. On three large scale benchmarks THUMOS14, ActivityNet (v1.2 and v1.3), and Charades, our approach achieves superior performances compared with other state-of-the-art methods and runs over 1020 frames per second (fps) on a single NVIDIA Titan-X Pascal GPU, indicating that our method can efficiently improve the precision of action localization task.
Similar content being viewed by others
References
Yeung S, Russakovsky O, Mori G, Fei-Fei L (2016) End-to-end learning of action detection from frame glimpses in videos. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 2016, pp 2678–2687. https://doi.org/10.1109/CVPR.2016.293
Tu Z, Xie W, Dauwels J, Li B, Yuan J (2019) Semantic cues enhanced multimodality multistream CNN for action recognition. IEEE Trans Circuits Syst Video Technol 29(5):1423–1437
Liu K, Gao L, Khan NM, Qi L, Guan L (2020) A multi-stream graph convolutional networks-hidden conditional random field model for skeleton-based action recognition. In: IEEE Transactions on Multimedia, vol 23, 2021, pp 64–76. https://doi.org/10.1109/TMM.2020.2974323
Lee I, Kim D, Lee S (2020) 3D human behavior understanding using generalized TS-LSTM networks. In: IEEE Transactions on Multimedia, vol 23, 2021, pp415–428. https://doi.org/10.1109/TMM.2020.2978637
Yang J, Liu W, Yuan J, Mei T (2020) Hierarchical soft quantization for skeleton-based human action recognition. In: IEEE Transactions on Multimedia, vol 23, 2021, pp 883–898. https://doi.org/10.1109/TMM.2020.2990082
Escorcia V, Heilbron FC, Niebles JC, Ghanem B (2016) Daps: deep action proposals for action understanding. In: European conference on computer vision. Springer, Cham, pp 768–784
Shou Z, Wang D, Chang S (2016) Temporal action localization in untrimmed videos via multi-stage CNNs. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 2016, pp 1049–1058. https://doi.org/10.1109/CVPR.2016.119
Gao J, Yang Z, Sun C, Chen K, Nevatia R (2017) Turn tap: temporal unit regression network for temporal action proposals. In: IEEE international conference on computer vision (ICCV), Venice, Italy, 2017, pp 3648–3656. https://doi.org/10.1109/ICCV.2017.392
Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) SST: single-stream temporal action proposals. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 2017, pp 6373–6382. https://doi.org/10.1109/CVPR.2017.675
Liu H, Wang S, Wang W, Cheng J (2020) Multi-scale based context-aware net for action detection. IEEE Trans Multimed 22(2):337–348
Gao J, Chen K, Nevatia R (2018) CTAP: complementary temporal action proposal generation. In: Proceedings of the European conference on computer vision (ECCV), pp 68–83
Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2914–2923
Huang J, Li N, Li T, Liu S, Li G (2020) Spatial-temporal context-aware online action detection and prediction. IEEE Trans Circuits Syst Video Technol 30(8):2650–2662
Gong G, Wang X, Mu Y, Tian Q (2020) Learning temporal co-attention models for unsupervised video action localization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp 9816–9825. https://doi.org/10.1109/CVPR42600.2020.00984
Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with fisher vectors on a compact feature set. In: IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 2013, pp 1817–1824. https://doi.org/10.1109/ICCV.2013.228
Oneata D, Verbeek J, Schmid C (2014) Efficient action localization with approximately normalized fisher vectors. In: IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp 2545–2552. https://doi.org/10.1109/CVPR.2014.326
Jain M, Gemert Jv, Jégou H, Bouthemy P, Snoek CGM (2014) In: IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp 740–747. https://doi.org/10.1109/CVPR.2014.100
Tang K, Yao B, Fei-Fei L, Koller D (2013) Combining the right features for complex event recognition. In: IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 2013, pp 2696–2703. https://doi.org/10.1109/ICCV.2013.335
Jiang YG, Liu J, Roshan Zamir A, Toderici G, Laptev I, Shah M, Sukthankar R (2014) Thumos challenge: action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/
Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) ActivityNet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp 961–970. https://doi.org/10.1109/CVPR.2015.7298698
Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowd sourcing data collection for activity understanding. In: European Conference on Computer Vision, Springer, Cham, pp 510–526
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer Vision—ECCV 2016. Lecture Notes in Computer Science, vol 9912. Springer, Cham. https://doi.org/10.1007/978-3-319-46484-8_2
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th international conference on neural information processing systems, vol 1. pp 568–576
Feichtenhofer C, Fan H, Malik J, He K (2018) Slowfast networks for video recognition. In: IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp 6201–6210. https://doi.org/10.1109/ICCV.2019.00630
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 4489–4497
Diba A, Sharma V, Van Gool L, Stiefelhagen R (2019) Dynamonet: dynamic action and motion network. In: IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp 6191–6200. https://doi.org/10.1109/ICCV.2019.00629
Girdhar R, Tran D, Torresani L, Ramanan D (2019) DistInit: learning video representations without a single labeled video. In: IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp 852–861. https://doi.org/10.1109/ICCV.2019.00094
Yu T, Wang L, Da C, Gu H, Xiang S, Pan C (2019) Weakly semantic guided action recognition. IEEE Trans Multimed 21(10):2504–2517
Chen G, Zhang C, Zou Y (2020) AFNet: temporal locality-aware network with dual structure for accurate and fast action detection. In: IEEE Transactions on Multimedia. https://doi.org/10.1109/TMM.2020.3014555
Wu H, Ma X, Li Y (2020) Convolutional networks with channel and STIPs attention model for action recognition in videos. IEEE Trans Multimed 22(9):2293–2306
Zhang T, Zheng W, Cui Z, Zong Y, Li C, Zhou X, Yang J (2020) Deep manifold-to-manifold transforming network for skeleton-based action recognition. In: IEEE Transactions on Multimedia, vol 22(11), pp 2926–2937. https://doi.org/10.1109/TMM.2020.2966878
Y. An, Y. Wang, Z. Li, Q. Yang, Yu (2019) PA3D: pose-action 3D machine for video recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp 7922–7931. https://doi.org/10.1109/CVPR.2019.00811
Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) Potion: pose motion representation for action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp 7024–7033. https://doi.org/10.1109/CVPR.2018.00734
Marcon M, Paracchini MBM, Tubaro S (2019) A framework for interpreting, modeling and recognizing human body gestures through 3D eigenpostures. Int J Mach Learn Cybern 10(5):1205–1226
Zhang S, Callaghan V (2021) Real-time human posture recognition using an adaptive hybrid classifier. Int J Mach Learn Cybern 12(2):489–499
Uijlings JRR, Sande KEAVD, Gevers T, Smeulders AWM (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171
Zitnick,CL, Dollar P (2014) Edge boxes: locating object proposals from edges. In: European conference on computer vision, Springer, Cham, pp 391–405
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp 580–587. https://doi.org/10.1109/CVPR.2014.81
Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 2016, pp 779–788. https://doi.org/10.1109/CVPR.2016.91
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) SSD: single shot multibox detector. In: European conference on computer vision, Springer, Cham pp 21–37
Cho S, Byun H (2016) A space-time graph optimization approach based on maximum cliques for action detection. IEEE Trans Circuits Syst Video Technol 26(4):661–672
Gao J, Yang Z, Nevatia R (2017) Cascaded boundary regression for temporal action detection. arXiv:1705.01180
Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) SST: single-stream temporal action proposals. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 2017, pp 6373–6382. https://doi.org/10.1109/CVPR.2017.675
Lin T, Zhao X, Shou Z (2017) Temporal convolution based action proposal: submission to activitynet 2017. arxiv.org/abs/1707.06750
Xu H, Das A, Saenko K (2017) R-C3D: region convolutional 3d network for temporal activity detection. In: IEEE international conference on computer vision (ICCV), Venice, Italy, 2017, pp 5794–5803. https://doi.org/10.1109/ICCV.2017.617
Xu M, Zhao C, Rojas DS, Thabet A, Ghanem B (2020) G-TAD: sub-graph localization for temporal action detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp 10153–10162. https://doi.org/10.1109/CVPR42600.2020.01017
Fan L, Huang W, Gan C, Ermon S, Gong B, Huang J (2018) End-to-end learning of motion representation for video understanding. In: IEEE/CVF conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 2018, pp 6016–6025. https://doi.org/10.1109/CVPR.2018.00630
Pont-Tuset J, Arbeláez P, Barron JT, Marques F, Malik J (2017) Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Trans Pattern Anal Mach Intell 39(1):128–140
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition (CVPR’06), New York, NY, USA, pp 2169–2178. https://doi.org/10.1109/CVPR.2006.68
Heilbron FC, Niebles JC, Ghanem B (2016) Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 2016, pp 1914–1923. https://doi.org/10.1109/CVPR.2016.211
Lin T, Zhao X, Su H, Wang C, Yang M (2018) BSN: boundary sensitive network for temporal action proposal generation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 3–19
Yuan J, Ni B, Yang X, Kassim AA (2016) Temporal action localization with pyramid of score distribution features. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 2016, pp 3093–3102. https://doi.org/10.1109/CVPR.2016.337
Yu T, Ren Z, Li Y, Yan E, Xu N, Yuan J (2019) Temporal structure mining for weakly supervised action detection. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), 2019, pp 5521–5530. https://doi.org/10.1109/ICCV.2019.00562
Narayan S, Cholakkal H, Khan FS, Shao L (2019) 3C-Net: Category count and center loss for weakly-supervised action localization. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), 2019, pp 8678–8686. https://doi.org/10.1109/ICCV.2019.00877
Nguyen PX, Ramanan D, Fowlkes CC (2019) Weakly-supervised action localization with background modeling. In: IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp 5501–5510. https://doi.org/10.1109/ICCV.2019.00560
Wang J, Wang W, Gao W (2020) Fast and accurate action detection in videos with motion-centric attention model. IEEE Trans Circuits Syst Video Technol 30(1):117–130
Shi B, Dai Q, Mu Y, Wang J (2020) Weakly-supervised action localization by generative attention modeling. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, USA, 2020, pp 1006–1016. https://doi.org/10.1109/CVPR42600.2020.00109
Liu Z, Wang L, Zhang Q, Gao Z, Niu Z, Zheng N, Hua G (2019) Weakly supervised temporal action localization through contrast based evaluation networks. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), 2019, pp 3898–3907. https://doi.org/10.1109/ICCV.2019.00400
Shou Z, Chan J, Zareian A, Miyazawa K, Chang S (2017) CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 2017, pp 1417–1426. https://doi.org/10.1109/CVPR.2017.155
Dai X, Singh B, Zhang G, Davis LS, Chen YQ (2017) Temporal context network for activity localization in videos. In: IEEE international conference on computer vision (ICCV), Venice, Italy, 2017, pp 5727–5736. https://doi.org/10.1109/ICCV.2017.610
Heilbron FC, Barrios W, Escorcia V, Ghanem B (2017) SCC: semantic context cascade for efficient action detection. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 2017, pp 3175–3184. https://doi.org/10.1109/CVPR.2017.338
Zeng R, Huang W, Gan C, Tan M, Rong Y, Zhao P, Huang J (2019) Graph convolutional networks for temporal action localization. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), 2019, pp 7093–7102. https://doi.org/10.1109/ICCV.2019.00719
Lin T, Liu X, Li X, Ding E, Wen S (2019) BMN: boundary-matching network for temporal action proposal generation. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), 2019, pp 3888–3897. https://doi.org/10.1109/ICCV.2019.00399
Sigurdsson GA, Divvala S, Farhadi A, Gupta A (2017) Asynchronous temporal fields for action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 2017, pp 5650–5659. https://doi.org/10.1109/CVPR.2017.599
Piergiovanni A, Ryoo MS (2018) Learning latent super-events to detect multiple activities in videos. In: IEEE/CVF conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 2018, pp 5304–5313. https://doi.org/10.1109/CVPR.2018.00556
Acknowledgements
This work was supported in part by the Foundation of National Natural Science Foundation of China under Grant 61973065, the Fundamental Research Funds for the Central Universities of China under Grant N172608005, N182612002 and N2026002, National Natural Science Foundation of China under Grant 61973065.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, F., Wang, G., Du, Y. et al. A two-stage temporal proposal network for precise action localization in untrimmed video. Int. J. Mach. Learn. & Cyber. 12, 2199–2211 (2021). https://doi.org/10.1007/s13042-021-01301-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-021-01301-z