Scene Understanding by Reasoning Stability and Safety

Zheng, Bo; Zhao, Yibiao; Yu, Joey; Ikeuchi, Katsushi; Zhu, Song-Chun

doi:10.1007/s11263-014-0795-4

Scene Understanding by Reasoning Stability and Safety

Published: 28 January 2015

Volume 112, pages 221–238, (2015)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Bo Zheng¹,
Yibiao Zhao²,
Joey Yu²,
Katsushi Ikeuchi¹ &
…
Song-Chun Zhu²

1499 Accesses
43 Citations
Explore all metrics

Abstract

This paper presents a new perspective for 3D scene understanding by reasoning object stability and safety using intuitive mechanics. Our approach utilizes a simple observation that, by human design, objects in static scenes should be stable in the gravity field and be safe with respect to various physical disturbances such as human activities. This assumption is applicable to all scene categories and poses useful constraints for the plausible interpretations (parses) in scene understanding. Given a 3D point cloud captured for a static scene by depth cameras, our method consists of three steps: (i) recovering solid 3D volumetric primitives from voxels; (ii) reasoning stability by grouping the unstable primitives to physically stable objects by optimizing the stability and the scene prior; and (iii) reasoning safety by evaluating the physical risks for objects under physical disturbances, such as human activity, wind or earthquakes. We adopt a novel intuitive physics model and represent the energy landscape of each primitive and object in the scene by a disconnectivity graph (DG). We construct a contact graph with nodes being 3D volumetric primitives and edges representing the supporting relations. Then we adopt a Swendson–Wang Cuts algorithm to partition the contact graph into groups, each of which is a stable object. In order to detect unsafe objects in a static scene, our method further infers hidden and situated causes (disturbances) in the scene, and then introduces intuitive physical mechanics to predict possible effects (e.g., falls) as consequences of the disturbances. In experiments, we demonstrate that the algorithm achieves a substantially better performance for (i) object segmentation, (ii) 3D volumetric recovery, and (iii) scene understanding with respect to other state-of-the-art methods. We also compare the safety prediction from the intuitive mechanics model with human judgement.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Joint Semantic Segmentation and 3D Reconstruction from Monocular Video

Probabilistic Representation of Objects and Their Support Relations

3D Point Cloud Video Segmentation Based on Interaction Analysis

References

Anand, A., Koppula, H., Joachims, T., & Saxena, A. (2012). Contextually guided semantic labeling and search for 3d point clouds. In IJRR.
Attene, M., Falcidieno, B., & Spagnuolo, M. (2006). Hierarchical mesh segmentation based on fitting primitives. The Visual Computer, 22, 181–193.
Article Google Scholar
Barbu, A., & Zhu, S. C. (2005). Generalizing Swendsen–Wang to sampling arbitrary posterior probabilities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 1239–1253.
Article Google Scholar
Biederman, I., Mezzanotte, R. J., & Rabinowitz, J. C. (1982). Scene perception: Detecting and judging objects undergoing relational violations. Cognitive Psychology, 14(2), 143–177.
Article Google Scholar
Blane, M., Lei, Z. B., & Cooper, D. B. (2000). The 3L algorithm for fitting implicit polynomial curves and surfaces to data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(3), 298–313.
Article Google Scholar
Chen, X., Golovinskiy, A., & Funkhouser, T. (2009). A benchmark for 3D mesh segmentation. In SIGGRAPH.
DARPA. (2014). Robots rescue people. http://www.i-programmer.info/news/169-robotics/6857-robots-rescue-people.html.
Delaitre, V., Fouhey, D., Laptev, I., Sivic, J., Gupta, A., & Efros, A. (2012). Scene semantics from long-term observation of people. In ECCV.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2), 167–181.
Article Google Scholar
Fleming, R., Barnett-Cowan, M., & Bülthoff, H. (2010). Perceived object stability is affected by the internal representation of gravity. Perception, 39, 109.
Google Scholar
Fouhey, D., Delaitre, V., Gupta, A., Efros, A., Laptev, I., & Sivic, J. (2012). People watching: Human actions as a cue for single-view geometry. In ECCV.
Furukawa, Y., Curless, B., Seitz, S. M., & Szeliski, R. (2009). Manhattan-world stereo. In CVPR.
Grabner, H., Gall, J., & Van, G. L. (2011). What makes a chair a chair? In CVPR.
Guo, R., & Hoiem, D. (2013). Support surface prediction in indoor scenes. In ICCV.
Gupta, A., Efros, A., & Hebert, M. (2010). Blocks world revisited: Image understanding using qualitative geometry and mechanics. In ECCV.
Gupta, A., Satkin, S., Efros, A., & Hebert, M. (2011). From 3D scene geometry to human workspace. In CVPR.
Hamrick, J., Battaglia, P., & Tenenbaum, J. (2011). Internal physics models guide probabilistic judgments about object dynamics. In Proceedings of the 33rd Annual Meeting of the Cognitive Science Society.
Hedau, V., Hoiem, D., & Forsyth, D. (2010). Thinking inside the box: Using appearance models and context based on room geometry. In ECCV.
Janoch, A., Karayev, S., Jia, Y., Barron, J. T., Fritz, M., Saenko, K., & Darrell, T. (2011). A category-level 3-d object dataset: Putting the kinect to work. In ICCV workshop.
Jia, Z., Gallagher, A., Saxena, A., & Chen, T. (2013). 3d-based reasoning with blocks, support, and stability. In CVPR.
Jiang, Y., & Saxena, A. (2013). Infinite latent conditional random fields for modeling environments through humans. In Robotics: Science and Systems (RSS).
Jiang, Y., Koppula, H.S., & Saxena, A. (2013). Hallucinated humans as the hidden context for labeling 3d scenes. In: CVPR.
Karpathy, A., Miller, S., & Fei-Fei, L. (2013). Object discovery in 3d scenes via shape analysis. In International Conference on Robotics and Automation (ICRA).
Koppula, H., Anand, A., Joachims, T., & Saxena, A. (2011). Semantic labeling of 3d point clouds for indoor scenes. In NIPS.
Kriegman, D. J. (1995). Let them fall where they may: Capture regions of curved objects and polyhedra. International Journal of Robotics Research, 16, 448–472.
Article Google Scholar
Lee, D., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In CVPR.
Lee, D., Gupta, A., Hebert, M., & Kanade, T. (2010). Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces advances in neural information processing systems. Cambridge: MIT.
Google Scholar
McCloskey, M. (1983). Intuitive physics. Scientific American, 248(4), 114–122.
Nan, L., Xie, K., & Sharf, A. (2012). A search-classify approach for cluttered indoor scene understanding. ACM Transactions on Graphics (TOG), 31(6), 137.
Article Google Scholar
Newcombe, R., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A., Kohli, P., Shotton, J., Hodges, S., & Fitzgibbon, A. (2011). Kinectfusion: Real-time dense surface mapping and tracking. In ISMAR.
Petti, S., & Fraichard, T. (2005). Safe motion planning in dynamic environments. In IROS.
Phillips, M., & Likhachev, M. (2011). Sipp: Safe interval path planning for dynamic environments. In ICRA.
Poppinga, J., Vaskevicius, N., Birk, A., & Pathak, K. (2008). Fast plane detection and polygonalization in noisy 3D range images. In IROS.
Sagawa, R., Nishino, K., & Ikeuchi, K. (2005). Adaptively merging large-scale range data with reflectance properties. IEEE Transaction on Pattern Analysis and Machine Intelligence, 27, 392–405.
Article Google Scholar
Savva, M., Chang, A. X., Hanrahan, P., & Fisher, M. (2014). Scenegrok: Inferring action maps in 3d environments. ACM Transactions on Graphics (TOG), 33(6), 212.
Article Google Scholar
Shao, T., Xu, W., Zhou, K., Wang, J., & Li, D. (2012). An interactive approach to semantic modeling of indoor scenes with an rgbd camera. ACM Transactions on Graphics (TOG), 31, 136.
Google Scholar
Shao, T., Monszpart, A., Zheng, Y., Koo, B., Ku, W., Zhou, K., et al. (2014). Imagining the unseen: Stability-based cuboid arrangements for scene understanding. ACM Transactions on Graphics (TOG), 33, 209.
Shi, Q. Y., & Ks, Fu. (1983). Parsing and translation of (attributed) expansive graph languages for scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(5), 472–485.
Article MATH Google Scholar
Silberman, N., Kohli, P., Hoiem, D. & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In ECCV.
Tu, Z., Chen, X., Yuille, A. L., & Zhu, S. C. (2005). Image parsing: Unifying segmentation, detection, and recognition. International Journal of Computer Vision, 63, 113.
Article Google Scholar
Wales, D. (2004). Energy landscapes: Applications to clusters, biomolecules and glasses. Cambridge: Cambridge Molecular Science, Cambridge University Press.
Book Google Scholar
Wu, C., Lenz, I., & Saxena, A. (2014). Hierarchical semantic labeling for task-relevant rgb-d perception. In Robotics: Science and systems (RSS).
Zhao, Y., & Zhu, S. C. (2011). Image parsing via stochastic scene grammar. In NIPS.
Zheng, B., Takamatsu, J., & Ikeuchi, K. (2010). An adaptive and stable method for fitting implicit polynomial curves and surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3), 561–568.
Article Google Scholar
Zheng, B., Zhao, Y., Yu, J. C., Ikeuchi, K., & Zhu, S. C. (2013). Beyond point cloud: Scene understanding by reasoning geometry and physics. In CVPR.
Zheng, B., Zhao, Y., Yu, J. C., Ikeuchi, K., & Zhu, S. C. (2014). Detecting potential falling objects by inferring human action and natural disturbance. In IEEE international conference on robotics and automation (ICRA).

Download references

Acknowledgments

This work is supported by (1) MURI ONR N00014-10-1-0933 and DARPA MSEE grant FA 8650-11-1-7149, USA, (2) Next-generation Energies for Tohoku Recovery (NET) and SCOPE Program of Ministry of Internal Affairs and Communications, Japan, (3) and the 10-th core Project Grant of Microsoft Japan.

Author information

Authors and Affiliations

The University of Tokyo, Tokyo, Japan
Bo Zheng & Katsushi Ikeuchi
University of California, Los Angeles (UCLA), Los Angeles, USA
Yibiao Zhao, Joey Yu & Song-Chun Zhu

Authors

Bo Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Yibiao Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Joey Yu
View author publications
You can also search for this author in PubMed Google Scholar
Katsushi Ikeuchi
View author publications
You can also search for this author in PubMed Google Scholar
Song-Chun Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Zheng.

Additional information

Communicated by Derek Hoiem, James Hays, Jianxiong Xiao, and Aditya Khosla.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, B., Zhao, Y., Yu, J. et al. Scene Understanding by Reasoning Stability and Safety. Int J Comput Vis 112, 221–238 (2015). https://doi.org/10.1007/s11263-014-0795-4

Download citation

Received: 09 February 2014
Accepted: 15 December 2014
Published: 28 January 2015
Issue Date: April 2015
DOI: https://doi.org/10.1007/s11263-014-0795-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scene Understanding by Reasoning Stability and Safety

Abstract

Access this article

Similar content being viewed by others

Joint Semantic Segmentation and 3D Reconstruction from Monocular Video

Probabilistic Representation of Objects and Their Support Relations

3D Point Cloud Video Segmentation Based on Interaction Analysis

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scene Understanding by Reasoning Stability and Safety

Abstract

Access this article

Similar content being viewed by others

Joint Semantic Segmentation and 3D Reconstruction from Monocular Video

Probabilistic Representation of Objects and Their Support Relations

3D Point Cloud Video Segmentation Based on Interaction Analysis

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation