Enhanced 3D Human Pose Estimation from Videos by Using Attention-Based Neural Network with Dilated Convolutions

Liu, Ruixu; Shen, Ju; Wang, He; Chen, Chen; Cheung, Sen-ching; Asari, Vijayan K.

doi:10.1007/s11263-021-01436-0

Enhanced 3D Human Pose Estimation from Videos by Using Attention-Based Neural Network with Dilated Convolutions

Published: 26 February 2021

Volume 129, pages 1596–1615, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Ruixu Liu ORCID: orcid.org/0000-0003-0458-3576¹,
Ju Shen¹,
He Wang¹,
Chen Chen²,
Sen-ching Cheung³ &
…
Vijayan K. Asari¹

1546 Accesses
18 Citations
Explore all metrics

Abstract

The attention mechanism provides a sequential prediction framework for learning spatial models with enhanced implicit temporal consistency. In this work, we show a systematic design (from 2D to 3D) for how conventional networks and other forms of constraints can be incorporated into the attention framework for learning long-range dependencies for the task of pose estimation. The contribution of this paper is to provide a systematic approach for designing and training of attention-based models for the end-to-end pose estimation, with the flexibility and scalability of arbitrary video sequences as input. We achieve this by adapting temporal receptive field via a multi-scale structure of dilated convolutions. Besides, the proposed architecture can be easily adapted to a causal model enabling real-time performance. Any off-the-shelf 2D pose estimation systems, e.g. Our method achieves the state-of-the-art performance and outperforms existing methods by reducing the mean per joint position error to 33.4mm on Human 3.6M dataset. Our code is available at https://github.com/lrxjason/Attention3DHumanPose

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ConvFormer: parameter reduction in transformer models for 3D human pose estimation by leveraging dynamic multi-headed convolutional attention

Article 03 July 2023

Efficient Spatial-Attention Module for Human Pose Estimation

EfficientPose: Efficient human pose estimation with neural architecture search

Article Open access 07 April 2021

Notes

Demo: https://sites.google.com/a/udayton.edu/jshen1/pose3d.

References

Amin, S., Andriluka, M., Rohrbach, M., & Schiele, B. (2013). Multiview pictorial structures for 3d human pose estimation. In BMVC.
Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In Conference on computer vision and pattern recognition (CVPR) (pp. 1–8).
Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural machine translation by jointly learning to align and translate. In ICLR.
Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271.
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., & Black, M. J. (2016). Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European conference on computer vision (ECCV) (pp. 1–18).
Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010 (pp. 177–186). Springer.
Chen, C. H., & Ramanan, D. (2017). 3d human pose estimation = 2d pose estimation + matching. In Conference on computer vision and pattern recognition (CVPR) (pp. 7035–7043).
Chen, T., Fang, C., Shen, X., Zhu, Y., Chen, Z., & Luo, J. (2020). Anatomy-aware 3d human pose estimation in videos. arXiv:2002.10322.
Chen, W., Wang, H., & Li, Y, et al. HS (2016). Synthesizing training images for boosting human 3d pose estimation. In Fourth international conference on 3D vision (3DV) (pp. 479–488).
Chen, Y., Shen, C., Chen, H., Wei, X. S., Liu, L., & Yang, J. (2019). Adversarial learning of structure-aware fully convolutional networks for landmark localization. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Cheng, Y., Yang, B., Wang, B., Yan, W., & Tan, R. T. (2019). Occlusion-aware networks for 3d human pose estimation in video. In Proceedings of the IEEE international conference on computer vision (pp. 723–732).
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., & Bengio, Y. (2015). Attention-based models for speech recognition. Advances in Neural Information Processing Systems, 28, 577–585.
Google Scholar
Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A. L., & Wang, X. (2017). Multi-context attention for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1831–1840).
Dabral, R., Mundhada, A., Kusupati, U., Afaque, S., Sharma, A., & Jain, A. (2018). Learning 3d human pose from structure and motion. In Proceedings of the European conference on computer vision (ECCV) (pp. 668–683).
Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language modeling with gated convolutional networks. In Proceedings of the 34th international conference on machine learning-volume 70, JMLR. org (pp. 933–941).
Fang, H. S., Xu, Y., Wang, W., Liu, X., & Zhu, S. C. (2018). Learning pose grammar to encode human body configuration for 3d pose estimation. In Thirty-second AAAI conference on artificial intelligence.
Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2009). Pose search: Retrieving people using their pose. In IEEE conference on computer vision and pattern recognition (pp. 1–8).
He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017). Mask r-cnn. In International conference on computer vision (ICCV) (pp. 2980–2988).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
Hochreiter, S., & Schmidhuber, J., (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Hossain, M., Little, JJ., & XXX. (2018). Exploiting temporal information for 3d human pose estimation. In Proceedings of the European conference on computer vision (ECCV) (pp. 68–84).
Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2013). Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325–1339.
Article Google Scholar
Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, Avd, Graves, A., & Kavukcuoglu, K. (2016). Neural machine translation in linear time. arXiv:1610.10099.
Lee, K., Lee, I., & Lee, S. (2018). Propagating lstm: 3d pose estimation based on joint interdependency. In Proceedings of the European conference on computer vision (ECCV) (pp. 119–135).
Lepetit, V., Fua, P., et al. (2005). Monocular model-based 3d tracking of rigid objects: A survey. Foundations and Trends® in Computer Graphics and Vision, 1(1), 1–89.
Article Google Scholar
Li, S., Zhang, W., & Chan, A. B. (2015). Maximum-margin structured learning with deep networks for 3d human pose estimation. In International conference on computer vision (ICCV) (pp. 2848–2856).
Lin, T., Dollar, P., Girshick, R. B., He, K., Hariharan, B., & Belongie, S. J. (2017). Feature pyramid networks for object detection. In Conference on computer vision and pattern recognition (CVPR) (pp. 936–944).
Liu, J., Guang, Y., & Rojas, J. (2020a). Gast-net: Graph attention spatio-temporal convolutional networks for 3d human pose estimation in video. arXiv:2003.14179.
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., & Han, J. (2019). On the variance of the adaptive learning rate and beyond. arXiv:1908.03265.
Liu, R., Shen, J., Wang, H., Chen, C., Cheung, S., & Asari, V. (2020b). Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 5064–5073).
Mandery, C., Terlemez, O., Do, M., Vahrenkamp, N., & Asfour, T. (2015). The kit whole-body human motion database. In International conference on advanced robotics (ICAR) (pp. 329–336).
Martinez, J., Hossain, R., Romero, J., & Little, J. J. (2017). A simple yet effective baseline for 3d human pose estimation. In International conference on computer vision (ICCV) (pp. 2659–2668).
Neverova, N., Wolf, C., Taylor, GW., & Nebout, F. (2014). Multi-scale deep learning for gesture detection and localization. In European conference on computer vision (ECCV) workshops (pp. 474–490).
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In European conference on computer vision (pp. 483–499).
Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). Wavenet: a generative model for raw audio. arXiv:1609.03499.
Palmero, C., Clapés, A., Bahnsen, C., Møgelmose, A., Moeslund, T. B., & Escalera, S. (2016). Multi-modal rgb-depth-thermal human body segmentation. International Journal of Computer Vision, 118(2), 217–239.
Article MathSciNet Google Scholar
Park, S., Hwang, J., & Kwak, N. (2016). 3d human pose estimation using convolutional neural networks with 2d pose information. In European conference on computer vision (ECCV) workshops (pp. 156–169).
Pavlakos, G., Zhou, X., Derpanis, K. G., & Daniilidis, K. (2017). Coarse-to-fine volumetric prediction for single-image 3d human pose. In Conference on computer vision and pattern recognition (CVPR) (pp. 1263–1272).
Pavllo, D., Feichtenhofer, C., Grangier, D., & Auli, M. (2019). 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7753–7762).
Rhodin, H., Spörri, J., Katircioglu, I., Constantin, V., Meyer, F., Müller, E., Salzmann, M., & Fua, P. (2018). Learning monocular 3d human pose estimation from multi-view images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8437–8446).
Ruck, D., Rogers, S., & Kabrisky, M. (1990). Feature selection using a multilayer perceptron. Journal of Neural Network Computing, 2(2), 40–48.
Google Scholar
Sarafianos, N., Boteanu, B., Ionescu, B., & Kakadiaris, IA. (2016). 3d human pose estimation: a review of the literature and analysis of covariates. In CVIU (pp. 1–20).
Sigal, L., Balan, A. O., & Black, M. J. (2010). Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(12), 4–27.
Article Google Scholar
Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. arXiv:1902.09212.
Tekin, B., Rozantsev, A., Lepetit, V., & Fua, P. (2016). Direct prediction of 3d body poses from motion compensated sequences. In Conference on computer vision and pattern recognition (CVPR) (pp. 991–1000).
Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In Conference on computer vision and pattern recognition (CVPR) (pp. 1653–1660).
Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I., & Schmid, C. (2017). Learning from synthetic humans. In Conference on computer vision and pattern recognition (CVPR) (pp. 109–117).
Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV) (pp. 466–481).
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, C., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning.
Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., & Wang, X. (2018). 3d human pose estimation in the wild by adversarial learning. In Conference on computer vision and pattern recognition (CVPR) (pp. 5255–5264).
Yang, Y., & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In Conference on computer vision and pattern recognition (CVPR) (pp. 1385–1392).
Yao, X., Durme, B., Callison-Burch, C., & Clark, P. (2013). Semi-markov phrase-based monolingual alignment. In Conference on empirical methods in natural language processing (pp. 590–600).
Yin, W., Schütze, H., Xiang, B., & Zhou, B. (2016). Abcnn: Attention-based convolutional neural network for modeling sentence pairs. Transactions of the Association for Computational Linguistics, 4, 259–272.
Article Google Scholar
Yoo, J., & Han, T. (2009). Fast normalized cross-correlation. Circuits, Systems and Signal Processing, 28(819), 1–13.
MATH Google Scholar
Zhang, M. R., Lucas, J., Hinton, G., & Ba, J. (2019). Lookahead optimizer: k steps forward, 1 step back. arXiv:1907.08610.
Zhao, L., Peng, X., Tian, Y., Kapadia, M., & Metaxas, D. N. (2019a). Semantic graph convolutional networks for 3d human pose regression. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3425–3435).
Zhao, L., Peng, X., Tian, Y., Kapadia, M., & Metaxas, DN. (2019b). Semantic graph convolutional networks for 3d human pose regression. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3425–3435).
Zhou, X., Sun, X., Zhang, W., Liang, S., & Wei, Y. (2016a). Deep kinematic pose regression. In European conference on computer vision (ECCV) workshops (pp. 156–169).
Zhou, X., Zhu, M., Leonardos, S., Derpanis, K. G., & Daniilidis, K. (2016b). Sparseness meets deepness: 3d human pose estimation from monocular video. In Conference on computer vision and pattern recognition (CVPR) (pp. 4966–4975).

Download references

Acknowledgements

This work is partially supported by the National Endowment for the Humanities under Grant No. AKA-260488-18 and National Science Foundation (NSF) under Grant No. 1910844.

Author information

Authors and Affiliations

University of Dayton, Ohio, USA
Ruixu Liu, Ju Shen, He Wang & Vijayan K. Asari
University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
Chen Chen
University of Kentucky, Lexington, KY, 40506, USA
Sen-ching Cheung

Authors

Ruixu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ju Shen
View author publications
You can also search for this author in PubMed Google Scholar
He Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Sen-ching Cheung
View author publications
You can also search for this author in PubMed Google Scholar
Vijayan K. Asari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruixu Liu.

Additional information

Communicated by Mei Chen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, R., Shen, J., Wang, H. et al. Enhanced 3D Human Pose Estimation from Videos by Using Attention-Based Neural Network with Dilated Convolutions. Int J Comput Vis 129, 1596–1615 (2021). https://doi.org/10.1007/s11263-021-01436-0

Download citation

Received: 21 December 2019
Accepted: 12 January 2021
Published: 26 February 2021
Issue Date: May 2021
DOI: https://doi.org/10.1007/s11263-021-01436-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhanced 3D Human Pose Estimation from Videos by Using Attention-Based Neural Network with Dilated Convolutions

Abstract

Access this article

Similar content being viewed by others

ConvFormer: parameter reduction in transformer models for 3D human pose estimation by leveraging dynamic multi-headed convolutional attention

Efficient Spatial-Attention Module for Human Pose Estimation

EfficientPose: Efficient human pose estimation with neural architecture search

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Enhanced 3D Human Pose Estimation from Videos by Using Attention-Based Neural Network with Dilated Convolutions

Abstract

Access this article

Similar content being viewed by others

ConvFormer: parameter reduction in transformer models for 3D human pose estimation by leveraging dynamic multi-headed convolutional attention

Efficient Spatial-Attention Module for Human Pose Estimation

EfficientPose: Efficient human pose estimation with neural architecture search

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation