Skip to main content
Log in

Enhanced 3D Human Pose Estimation from Videos by Using Attention-Based Neural Network with Dilated Convolutions

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The attention mechanism provides a sequential prediction framework for learning spatial models with enhanced implicit temporal consistency. In this work, we show a systematic design (from 2D to 3D) for how conventional networks and other forms of constraints can be incorporated into the attention framework for learning long-range dependencies for the task of pose estimation. The contribution of this paper is to provide a systematic approach for designing and training of attention-based models for the end-to-end pose estimation, with the flexibility and scalability of arbitrary video sequences as input. We achieve this by adapting temporal receptive field via a multi-scale structure of dilated convolutions. Besides, the proposed architecture can be easily adapted to a causal model enabling real-time performance. Any off-the-shelf 2D pose estimation systems, e.g. Our method achieves the state-of-the-art performance and outperforms existing methods by reducing the mean per joint position error to 33.4mm on Human 3.6M dataset. Our code is available at https://github.com/lrxjason/Attention3DHumanPose

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. Demo: https://sites.google.com/a/udayton.edu/jshen1/pose3d.

References

  • Amin, S., Andriluka, M., Rohrbach, M., & Schiele, B. (2013). Multiview pictorial structures for 3d human pose estimation. In BMVC.

  • Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In Conference on computer vision and pattern recognition (CVPR) (pp. 1–8).

  • Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural machine translation by jointly learning to align and translate. In ICLR.

  • Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271.

  • Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., & Black, M. J. (2016). Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European conference on computer vision (ECCV) (pp. 1–18).

  • Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010 (pp. 177–186). Springer.

  • Chen, C. H., & Ramanan, D. (2017). 3d human pose estimation = 2d pose estimation + matching. In Conference on computer vision and pattern recognition (CVPR) (pp. 7035–7043).

  • Chen, T., Fang, C., Shen, X., Zhu, Y., Chen, Z., & Luo, J. (2020). Anatomy-aware 3d human pose estimation in videos. arXiv:2002.10322.

  • Chen, W., Wang, H., & Li, Y, et al. HS (2016). Synthesizing training images for boosting human 3d pose estimation. In Fourth international conference on 3D vision (3DV) (pp. 479–488).

  • Chen, Y., Shen, C., Chen, H., Wei, X. S., Liu, L., & Yang, J. (2019). Adversarial learning of structure-aware fully convolutional networks for landmark localization. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Cheng, Y., Yang, B., Wang, B., Yan, W., & Tan, R. T. (2019). Occlusion-aware networks for 3d human pose estimation in video. In Proceedings of the IEEE international conference on computer vision (pp. 723–732).

  • Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., & Bengio, Y. (2015). Attention-based models for speech recognition. Advances in Neural Information Processing Systems, 28, 577–585.

    Google Scholar 

  • Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A. L., & Wang, X. (2017). Multi-context attention for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1831–1840).

  • Dabral, R., Mundhada, A., Kusupati, U., Afaque, S., Sharma, A., & Jain, A. (2018). Learning 3d human pose from structure and motion. In Proceedings of the European conference on computer vision (ECCV) (pp. 668–683).

  • Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language modeling with gated convolutional networks. In Proceedings of the 34th international conference on machine learning-volume 70, JMLR. org (pp. 933–941).

  • Fang, H. S., Xu, Y., Wang, W., Liu, X., & Zhu, S. C. (2018). Learning pose grammar to encode human body configuration for 3d pose estimation. In Thirty-second AAAI conference on artificial intelligence.

  • Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2009). Pose search: Retrieving people using their pose. In IEEE conference on computer vision and pattern recognition (pp. 1–8).

  • He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017). Mask r-cnn. In International conference on computer vision (ICCV) (pp. 2980–2988).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • Hochreiter, S., & Schmidhuber, J., (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

  • Hossain, M., Little, JJ., & XXX. (2018). Exploiting temporal information for 3d human pose estimation. In Proceedings of the European conference on computer vision (ECCV) (pp. 68–84).

  • Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2013). Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325–1339.

    Article  Google Scholar 

  • Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, Avd, Graves, A., & Kavukcuoglu, K. (2016). Neural machine translation in linear time. arXiv:1610.10099.

  • Lee, K., Lee, I., & Lee, S. (2018). Propagating lstm: 3d pose estimation based on joint interdependency. In Proceedings of the European conference on computer vision (ECCV) (pp. 119–135).

  • Lepetit, V., Fua, P., et al. (2005). Monocular model-based 3d tracking of rigid objects: A survey. Foundations and Trends® in Computer Graphics and Vision, 1(1), 1–89.

    Article  Google Scholar 

  • Li, S., Zhang, W., & Chan, A. B. (2015). Maximum-margin structured learning with deep networks for 3d human pose estimation. In International conference on computer vision (ICCV) (pp. 2848–2856).

  • Lin, T., Dollar, P., Girshick, R. B., He, K., Hariharan, B., & Belongie, S. J. (2017). Feature pyramid networks for object detection. In Conference on computer vision and pattern recognition (CVPR) (pp. 936–944).

  • Liu, J., Guang, Y., & Rojas, J. (2020a). Gast-net: Graph attention spatio-temporal convolutional networks for 3d human pose estimation in video. arXiv:2003.14179.

  • Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., & Han, J. (2019). On the variance of the adaptive learning rate and beyond. arXiv:1908.03265.

  • Liu, R., Shen, J., Wang, H., Chen, C., Cheung, S., & Asari, V. (2020b). Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 5064–5073).

  • Mandery, C., Terlemez, O., Do, M., Vahrenkamp, N., & Asfour, T. (2015). The kit whole-body human motion database. In International conference on advanced robotics (ICAR) (pp. 329–336).

  • Martinez, J., Hossain, R., Romero, J., & Little, J. J. (2017). A simple yet effective baseline for 3d human pose estimation. In International conference on computer vision (ICCV) (pp. 2659–2668).

  • Neverova, N., Wolf, C., Taylor, GW., & Nebout, F. (2014). Multi-scale deep learning for gesture detection and localization. In European conference on computer vision (ECCV) workshops (pp. 474–490).

  • Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In European conference on computer vision (pp. 483–499).

  • Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). Wavenet: a generative model for raw audio. arXiv:1609.03499.

  • Palmero, C., Clapés, A., Bahnsen, C., Møgelmose, A., Moeslund, T. B., & Escalera, S. (2016). Multi-modal rgb-depth-thermal human body segmentation. International Journal of Computer Vision, 118(2), 217–239.

    Article  MathSciNet  Google Scholar 

  • Park, S., Hwang, J., & Kwak, N. (2016). 3d human pose estimation using convolutional neural networks with 2d pose information. In European conference on computer vision (ECCV) workshops (pp. 156–169).

  • Pavlakos, G., Zhou, X., Derpanis, K. G., & Daniilidis, K. (2017). Coarse-to-fine volumetric prediction for single-image 3d human pose. In Conference on computer vision and pattern recognition (CVPR) (pp. 1263–1272).

  • Pavllo, D., Feichtenhofer, C., Grangier, D., & Auli, M. (2019). 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7753–7762).

  • Rhodin, H., Spörri, J., Katircioglu, I., Constantin, V., Meyer, F., Müller, E., Salzmann, M., & Fua, P. (2018). Learning monocular 3d human pose estimation from multi-view images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8437–8446).

  • Ruck, D., Rogers, S., & Kabrisky, M. (1990). Feature selection using a multilayer perceptron. Journal of Neural Network Computing, 2(2), 40–48.

    Google Scholar 

  • Sarafianos, N., Boteanu, B., Ionescu, B., & Kakadiaris, IA. (2016). 3d human pose estimation: a review of the literature and analysis of covariates. In CVIU (pp. 1–20).

  • Sigal, L., Balan, A. O., & Black, M. J. (2010). Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(12), 4–27.

    Article  Google Scholar 

  • Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. arXiv:1902.09212.

  • Tekin, B., Rozantsev, A., Lepetit, V., & Fua, P. (2016). Direct prediction of 3d body poses from motion compensated sequences. In Conference on computer vision and pattern recognition (CVPR) (pp. 991–1000).

  • Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In Conference on computer vision and pattern recognition (CVPR) (pp. 1653–1660).

  • Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I., & Schmid, C. (2017). Learning from synthetic humans. In Conference on computer vision and pattern recognition (CVPR) (pp. 109–117).

  • Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV) (pp. 466–481).

  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, C., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning.

  • Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., & Wang, X. (2018). 3d human pose estimation in the wild by adversarial learning. In Conference on computer vision and pattern recognition (CVPR) (pp. 5255–5264).

  • Yang, Y., & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In Conference on computer vision and pattern recognition (CVPR) (pp. 1385–1392).

  • Yao, X., Durme, B., Callison-Burch, C., & Clark, P. (2013). Semi-markov phrase-based monolingual alignment. In Conference on empirical methods in natural language processing (pp. 590–600).

  • Yin, W., Schütze, H., Xiang, B., & Zhou, B. (2016). Abcnn: Attention-based convolutional neural network for modeling sentence pairs. Transactions of the Association for Computational Linguistics, 4, 259–272.

    Article  Google Scholar 

  • Yoo, J., & Han, T. (2009). Fast normalized cross-correlation. Circuits, Systems and Signal Processing, 28(819), 1–13.

    MATH  Google Scholar 

  • Zhang, M. R., Lucas, J., Hinton, G., & Ba, J. (2019). Lookahead optimizer: k steps forward, 1 step back. arXiv:1907.08610.

  • Zhao, L., Peng, X., Tian, Y., Kapadia, M., & Metaxas, D. N. (2019a). Semantic graph convolutional networks for 3d human pose regression. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3425–3435).

  • Zhao, L., Peng, X., Tian, Y., Kapadia, M., & Metaxas, DN. (2019b). Semantic graph convolutional networks for 3d human pose regression. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3425–3435).

  • Zhou, X., Sun, X., Zhang, W., Liang, S., & Wei, Y. (2016a). Deep kinematic pose regression. In European conference on computer vision (ECCV) workshops (pp. 156–169).

  • Zhou, X., Zhu, M., Leonardos, S., Derpanis, K. G., & Daniilidis, K. (2016b). Sparseness meets deepness: 3d human pose estimation from monocular video. In Conference on computer vision and pattern recognition (CVPR) (pp. 4966–4975).

Download references

Acknowledgements

This work is partially supported by the National Endowment for the Humanities under Grant No. AKA-260488-18 and National Science Foundation (NSF) under Grant No. 1910844.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruixu Liu.

Additional information

Communicated by Mei Chen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, R., Shen, J., Wang, H. et al. Enhanced 3D Human Pose Estimation from Videos by Using Attention-Based Neural Network with Dilated Convolutions. Int J Comput Vis 129, 1596–1615 (2021). https://doi.org/10.1007/s11263-021-01436-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01436-0

Keywords

Navigation