Abstract
Temporal Action Detection is an important yet challenging task, in which temporal action proposal generation plays an important part. Since the temporal boundaries of action instances in videos are often ambiguous, it’s difficult to locate them precisely. Boundary Sensitive Network (BSN) (Lin et al. in ECCV, 2018) is a state-of-the-art corner-based method that can generate high-quality proposals with high recall rate. It contains a temporal evaluation network and a proposal evaluation network to generate and evaluate proposals separately, which can find the temporal boundaries of action instances directly to produce proposals with flexible temporal intervals and evaluate the quality of proposals. But BSN still has some issues: (1) Due to the small reception field of temporal evaluation network, it often generates many false temporal boundaries. (2) Evaluating the quality of proposals is a difficult task and not well solved in the paper. To address these issues, we propose Complementary Boundary Estimation Network (CBEN), an improved approach to temporal action proposal generation based on the framework of BSN. Specifically, we improve BSN in two aspects: Firstly, considering the temporal evaluation network of BSN can only capture local information and tends to have high response at background segments, we combine it with a new network with larger reception field to better identify false temporal action boundaries. Secondly, to evaluate the quality of temporal action proposals more accurately, we propose a class-based proposal evaluation network and combine it with a tIoU-based proposal evaluation network to filter out low-quality proposals. Extensive experiments on THUMOS14 and ActivityNet-1.3 datasets indicate that CBEN can achieve better performance than current mainstream methods on temporal action proposal generation. We further combine CBEN with an off-the-shelf action classifier, and show consistent performance improvements on THUMOS14 dataset.
Similar content being viewed by others
References
Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-nms–improving object detection with one line of code. In: ICCV, pp 5561–5569
Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) Sst: single-stream temporal action proposals. In: CVPR, pp 2911–2920
Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR, pp 961–970
Chao YW, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethinking the faster r-cnn architecture for temporal action localization. In: CVPR, pp 1130–1139
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: CVPR, pp 248–255
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: CVPR, pp 1933–1941
Gao J, Chen K, Nevatia R (2018) Ctap: complementary temporal action proposal generation. In: ECCV, pp 68–83
Gao J, Yang Z, Chen S, Kan C, Nevatia R (2017) Turn tap: temporal unit regression network for temporal action proposals. In: ICCV, pp 3628–3636
Gao J, Yang Z, Nevatia R (2017) Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: ICCV, pp 2961–2969
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition, pp 770–778
Jiang YG, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2014) Thumos challenge: action recognition with a large number of classes
Law H, Deng J (2018) Cornernet: detecting objects as paired keypoints. In: ECCV, pp 734–750
Li X, Lin T, Liu X, Gan C, Zuo W, Li C, Long X, He D, Li F, Wen S (2019) Deep concept-wise temporal convolutional networks for action localization. In: ICCV
Lin T, Zhao X, Shou Z (2017) Single shot temporal action detection. In: ACM international conference on multimedia, pp 988–996
Lin T, Zhao X, Su H, Wang C, Yang M (2018) Bsn: boundary sensitive network for temporal action proposal generation. In: ECCV, pp 3–19
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: ECCV
Lu X, Li B, Yue Y, Li Q, Yan J (2019) Grid r-cnn. In: CVPR, pp 7363–7372
Luo W, Li Y, Urtasun R, Zemel RS (2016) Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp 4898–4906
Newell A, Yang K, Jia D (2016) Stacked hourglass networks for human pose estimation. In: ECCV, pp 483–499
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. PAMI 39(6):1137–1149
Shou Z, Chan J, Zareian A, Miyazawa K, Chang SF (2017) Cdc: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: CVPR, pp 5734–5743
Shou Z, Wang D, Chang SF (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In: CVPR, pp 1049–1058
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp 4489–4497
Xiong Y, Wang L, Zhe W, Zhang B, Hang S, Wei L, Lin D, Yu Q, Gool LV, Tang X (2016) Cuhk & ethz & siat submission to activitynet challenge 2016. arXiv preprint arXiv:1608.00797
Xiong Y, Yue Z, Wang L, Lin D, Tang X (2017) A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716
Xu H, Das A, Saenko K (2017) R-c3d: region convolutional 3d network for temporal activity detection. In: ICCV, pp 5783–5792
Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: ICCV, pp 2914–2923
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grants 61673402, 61273270, and 60802069; in part by the Natural Science Foundation of Guangdong under Grants 2017A030311029; in part by the Science and Technology Program of Guangzhou under Grants 201704020180; and in part by the Fundamental Research Funds for the Central Universities of China.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Calculation of tIoU
tIoU is short for temporal Intersection over Union. Figure 13 shows the beginning and ending time of a temporal action proposal and ground truth. We can calculate the tIoU between the proposal and ground truth by
The higher value of tIoU represents the proposal is closer to ground truth, e.g. the higher quality of the proposal.
Appendix B: Derivative of Mean Square Error
Assume we use mean square error to optimize the tIoU-based PEN, then the firs-order derivative of weights in the output layer can be calculated by
To be noted, \(g(x)=(x-y)(x-1)x\) is not an increasing function when \(x,y\in (0,1)\), so \(\frac{\partial {J}}{\partial {\mathbf {w}}}\) is not a increasing function either which means \(J(\mathbf {w})\) is a non-convex function.
Appendix C: Derivative of Softmax Cross Entropy
Assume softmax cross entropy is used to optimize the class-based PEN, then we can calculate the first-order and second-order derivative of weights in the output layer by
Considering \(a_n\in (0,1)\), \(\frac{\partial {^2J}}{\partial {\mathbf {w}_n^2}}\) is positive semi-definite matrix, which means \(J(\mathbf {w})\) is a convex function.
Rights and permissions
About this article
Cite this article
Wang, J., Hu, H. Complementary Boundary Estimation Network for Temporal Action Proposal Generation. Neural Process Lett 52, 2275–2295 (2020). https://doi.org/10.1007/s11063-020-10349-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-020-10349-x