Skip to main content

Abstract

We describe two new approaches to human pose estimation. Both can quickly and accurately predict the 3D positions of body joints from a single depth image, without using any temporal information. The key to both approaches is the use of a large, realistic, and highly varied synthetic set of training images. This allows us to learn models that are largely invariant to factors such as pose, body shape, and field-of-view cropping. Our first approach employs an intermediate body parts representation, designed so that an accurate per-pixel classification of the parts will localize the joints of the body. The second approach instead directly regresses the positions of body joints. By using simple depth pixel comparison features, and parallelizable decision forests, both approaches can run super-realtime on consumer hardware. Our evaluation investigates many aspects of our methods, and compares the approaches to each other and to the state of the art. Parts of this chapter are reprinted, with permission, from Shotton et al., Proc IEEE Conf. Computer Vision and Pattern Recognition (CVPR) (2011), © 2011 IEEE.

This work was undertaken at Microsoft Research, Cambridge, in collaboration with Xbox. See http://research.microsoft.com/vision/. Ross Girshick is currently a postdoctoral fellow at UC Berkeley.

Parts of this chapter are reprinted, with permission, from [343], © 2011 IEEE.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We use K to indicate the maximum number of relative votes allowed. In practice we allow some leaf nodes to store fewer than K votes for some joints.

  2. 2.

    Recall that for notational simplicity we are assuming u defines a pixel 2D position in a particular image; the ground truth joint positions P will therefore correspond for each particular image.

  3. 3.

    This threshold could equivalently be applied at test time though would waste memory in the tree.

  4. 4.

    The results for ojr at 300k images were so compelling we chose not to expend the considerable energy in training a directly comparable 900k forest.

References

  1. Belongie S, Malik J, Puzicha J (2002) Shape matching and object recognition using shape contexts. IEEE Trans Pattern Anal Mach Intell 24

    Google Scholar 

  2. Bourdev L, Malik J (2009) Poselets: body part detectors trained using 3D human pose annotations. In: Proc IEEE intl conf on computer vision (ICCV)

    Google Scholar 

  3. Bregler C, Malik J (1998) Tracking people with twists and exponential maps. In: Proc IEEE conf computer vision and pattern recognition (CVPR)

    Google Scholar 

  4. Brubaker MA, Fleet DJ, Hertzmann A (2010) Physics-based person tracking using the anthropomorphic walker. Int J Comput Vis

    Google Scholar 

  5. Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5)

    Google Scholar 

  6. Criminisi A, Shotton J, Robertson D, Konukoglu E (2010) Regression forests for efficient anatomy detection and localization in CT studies. In: MICCAI workshop on medical computer vision: recognition techniques and applications in medical imaging, Beijing. Springer, Berlin

    Google Scholar 

  7. Criminisi A, Shotton J, Konukoglu E (2012) Decision forests: a unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Found Trends Comput Graph Vis 7(2–3)

    Google Scholar 

  8. Fergus R, Perona P, Zisserman A (2003) Object class recognition by unsupervised scale-invariant learning. In: Proc IEEE conf computer vision and pattern recognition (CVPR)

    Google Scholar 

  9. Gall J, Lempitsky V (2009) Class-specific Hough forests for object detection. IEEE Trans Pattern Anal Mach Intell

    Google Scholar 

  10. Ganapathi V, Plagemann C, Koller D, Thrun S (2010) Real time motion capture using a single time-of-flight camera. In: Proc IEEE conf computer vision and pattern recognition (CVPR). IEEE, New York

    Google Scholar 

  11. Girshick R, Shotton J, Kohli P, Criminisi A, Fitzgibbon A (2011) Efficient regression of general-activity human poses from depth images. In: Proc IEEE intl conf on computer vision (ICCV)

    Google Scholar 

  12. Grest D, Woetzel J, Koch R (2005) Nonlinear body pose estimation from depth images. In: Proc annual symposium of the German association for pattern recognition (DAGM)

    Google Scholar 

  13. Hastie T, Tibshirani R, Friedman J, Franklin J (2005) The elements of statistical learning: data mining, inference and prediction. Math Intell 27(2)

    Google Scholar 

  14. Knoop S, Vacek S, Dillmann R (2006) Sensor fusion for 3D human body tracking with an articulated 3D body model. In: Proc IEEE intl conf on robotics and automation (ICRA)

    Google Scholar 

  15. Leibe B, Leonardis A, Schiele B (2008) Robust object detection with interleaved categorization and segmentation. Int J Comput Vis 77(1–3)

    Google Scholar 

  16. Lepetit V, Lagger P, Fua P (2005) Randomized trees for real-time keypoint recognition. In: Proc IEEE conf computer vision and pattern recognition (CVPR)

    Google Scholar 

  17. Microsoft Corporation Kinect for Windows and Xbox 360

    Google Scholar 

  18. Müller J, Arens M (2010) Human pose estimation with implicit shape models. In: ARTEMIS

    Google Scholar 

  19. Plagemann C, Ganapathi V, Koller D, Thrun S (2010) Real-time identification and localization of body parts from depth images. In: Proc IEEE intl conf on robotics and automation (ICRA)

    Google Scholar 

  20. Sharp T (2008) Implementing decision trees and forests on a GPU. In: Proc European conf on computer vision (ECCV). Springer, Berlin

    Google Scholar 

  21. Shotton J, Winn J, Rother C, Criminisi A (2006) TextonBoost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In: Proc European conf on computer vision (ECCV). Springer, Berlin

    Google Scholar 

  22. Shotton J, Fitzgibbon AW, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from a single depth image. In: Proc IEEE conf computer vision and pattern recognition (CVPR)

    Google Scholar 

  23. Shotton J, Girshick R, Fitzgibbon A, Sharp T, Cook M, Finocchio M, Moore R, Kohli P, Criminisi A, Kipman A, Blake A (2012) Efficient human pose estimation from single depth images. IEEE Trans Pattern Anal Mach Intell

    Google Scholar 

  24. Siddiqui M, Medioni G (2010) Human pose estimation from a single view point, real-time range sensor. In: CVCG at CVPR

    Google Scholar 

  25. Sigal L, Bhatia S, Roth S, Black MJ, Isard M (2004) Tracking loose-limbed people. In: Proc IEEE conf computer vision and pattern recognition (CVPR)

    Google Scholar 

  26. Urtasun R, Darrell T (2008) Local probabilistic regression for activity-independent human pose inference. In: Proc IEEE conf computer vision and pattern recognition (CVPR)

    Google Scholar 

  27. Vitter JS (1985) Random sampling with a reservoir. ACM Trans Math Softw 11(1)

    Google Scholar 

  28. Wang RY, Popović J (2009) Real-time hand-tracking with a color glove. In: Proc ACM SIGGRAPH

    Google Scholar 

  29. Winn J, Shotton J (2006) The layout consistent random field for recognizing and segmenting partially occluded objects. In: Proc IEEE conf computer vision and pattern recognition (CVPR)

    Google Scholar 

  30. Zhu Y, Fujimura K (2007) Constrained optimization for human pose estimation from depth sequences. In: Proc Asian conf on computer vision (ACCV)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag London

About this chapter

Cite this chapter

Shotton, J. et al. (2013). Efficient Human Pose Estimation from Single Depth Images. In: Criminisi, A., Shotton, J. (eds) Decision Forests for Computer Vision and Medical Image Analysis. Advances in Computer Vision and Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-4471-4929-3_13

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-4929-3_13

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-4928-6

  • Online ISBN: 978-1-4471-4929-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics