Abstract
This paper presents a new perspective for 3D scene understanding by reasoning object stability and safety using intuitive mechanics. Our approach utilizes a simple observation that, by human design, objects in static scenes should be stable in the gravity field and be safe with respect to various physical disturbances such as human activities. This assumption is applicable to all scene categories and poses useful constraints for the plausible interpretations (parses) in scene understanding. Given a 3D point cloud captured for a static scene by depth cameras, our method consists of three steps: (i) recovering solid 3D volumetric primitives from voxels; (ii) reasoning stability by grouping the unstable primitives to physically stable objects by optimizing the stability and the scene prior; and (iii) reasoning safety by evaluating the physical risks for objects under physical disturbances, such as human activity, wind or earthquakes. We adopt a novel intuitive physics model and represent the energy landscape of each primitive and object in the scene by a disconnectivity graph (DG). We construct a contact graph with nodes being 3D volumetric primitives and edges representing the supporting relations. Then we adopt a Swendson–Wang Cuts algorithm to partition the contact graph into groups, each of which is a stable object. In order to detect unsafe objects in a static scene, our method further infers hidden and situated causes (disturbances) in the scene, and then introduces intuitive physical mechanics to predict possible effects (e.g., falls) as consequences of the disturbances. In experiments, we demonstrate that the algorithm achieves a substantially better performance for (i) object segmentation, (ii) 3D volumetric recovery, and (iii) scene understanding with respect to other state-of-the-art methods. We also compare the safety prediction from the intuitive mechanics model with human judgement.
Similar content being viewed by others
References
Anand, A., Koppula, H., Joachims, T., & Saxena, A. (2012). Contextually guided semantic labeling and search for 3d point clouds. In IJRR.
Attene, M., Falcidieno, B., & Spagnuolo, M. (2006). Hierarchical mesh segmentation based on fitting primitives. The Visual Computer, 22, 181–193.
Barbu, A., & Zhu, S. C. (2005). Generalizing Swendsen–Wang to sampling arbitrary posterior probabilities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 1239–1253.
Biederman, I., Mezzanotte, R. J., & Rabinowitz, J. C. (1982). Scene perception: Detecting and judging objects undergoing relational violations. Cognitive Psychology, 14(2), 143–177.
Blane, M., Lei, Z. B., & Cooper, D. B. (2000). The 3L algorithm for fitting implicit polynomial curves and surfaces to data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(3), 298–313.
Chen, X., Golovinskiy, A., & Funkhouser, T. (2009). A benchmark for 3D mesh segmentation. In SIGGRAPH.
DARPA. (2014). Robots rescue people. http://www.i-programmer.info/news/169-robotics/6857-robots-rescue-people.html.
Delaitre, V., Fouhey, D., Laptev, I., Sivic, J., Gupta, A., & Efros, A. (2012). Scene semantics from long-term observation of people. In ECCV.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2), 167–181.
Fleming, R., Barnett-Cowan, M., & Bülthoff, H. (2010). Perceived object stability is affected by the internal representation of gravity. Perception, 39, 109.
Fouhey, D., Delaitre, V., Gupta, A., Efros, A., Laptev, I., & Sivic, J. (2012). People watching: Human actions as a cue for single-view geometry. In ECCV.
Furukawa, Y., Curless, B., Seitz, S. M., & Szeliski, R. (2009). Manhattan-world stereo. In CVPR.
Grabner, H., Gall, J., & Van, G. L. (2011). What makes a chair a chair? In CVPR.
Guo, R., & Hoiem, D. (2013). Support surface prediction in indoor scenes. In ICCV.
Gupta, A., Efros, A., & Hebert, M. (2010). Blocks world revisited: Image understanding using qualitative geometry and mechanics. In ECCV.
Gupta, A., Satkin, S., Efros, A., & Hebert, M. (2011). From 3D scene geometry to human workspace. In CVPR.
Hamrick, J., Battaglia, P., & Tenenbaum, J. (2011). Internal physics models guide probabilistic judgments about object dynamics. In Proceedings of the 33rd Annual Meeting of the Cognitive Science Society.
Hedau, V., Hoiem, D., & Forsyth, D. (2010). Thinking inside the box: Using appearance models and context based on room geometry. In ECCV.
Janoch, A., Karayev, S., Jia, Y., Barron, J. T., Fritz, M., Saenko, K., & Darrell, T. (2011). A category-level 3-d object dataset: Putting the kinect to work. In ICCV workshop.
Jia, Z., Gallagher, A., Saxena, A., & Chen, T. (2013). 3d-based reasoning with blocks, support, and stability. In CVPR.
Jiang, Y., & Saxena, A. (2013). Infinite latent conditional random fields for modeling environments through humans. In Robotics: Science and Systems (RSS).
Jiang, Y., Koppula, H.S., & Saxena, A. (2013). Hallucinated humans as the hidden context for labeling 3d scenes. In: CVPR.
Karpathy, A., Miller, S., & Fei-Fei, L. (2013). Object discovery in 3d scenes via shape analysis. In International Conference on Robotics and Automation (ICRA).
Koppula, H., Anand, A., Joachims, T., & Saxena, A. (2011). Semantic labeling of 3d point clouds for indoor scenes. In NIPS.
Kriegman, D. J. (1995). Let them fall where they may: Capture regions of curved objects and polyhedra. International Journal of Robotics Research, 16, 448–472.
Lee, D., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In CVPR.
Lee, D., Gupta, A., Hebert, M., & Kanade, T. (2010). Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces advances in neural information processing systems. Cambridge: MIT.
McCloskey, M. (1983). Intuitive physics. Scientific American, 248(4), 114–122.
Nan, L., Xie, K., & Sharf, A. (2012). A search-classify approach for cluttered indoor scene understanding. ACM Transactions on Graphics (TOG), 31(6), 137.
Newcombe, R., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A., Kohli, P., Shotton, J., Hodges, S., & Fitzgibbon, A. (2011). Kinectfusion: Real-time dense surface mapping and tracking. In ISMAR.
Petti, S., & Fraichard, T. (2005). Safe motion planning in dynamic environments. In IROS.
Phillips, M., & Likhachev, M. (2011). Sipp: Safe interval path planning for dynamic environments. In ICRA.
Poppinga, J., Vaskevicius, N., Birk, A., & Pathak, K. (2008). Fast plane detection and polygonalization in noisy 3D range images. In IROS.
Sagawa, R., Nishino, K., & Ikeuchi, K. (2005). Adaptively merging large-scale range data with reflectance properties. IEEE Transaction on Pattern Analysis and Machine Intelligence, 27, 392–405.
Savva, M., Chang, A. X., Hanrahan, P., & Fisher, M. (2014). Scenegrok: Inferring action maps in 3d environments. ACM Transactions on Graphics (TOG), 33(6), 212.
Shao, T., Xu, W., Zhou, K., Wang, J., & Li, D. (2012). An interactive approach to semantic modeling of indoor scenes with an rgbd camera. ACM Transactions on Graphics (TOG), 31, 136.
Shao, T., Monszpart, A., Zheng, Y., Koo, B., Ku, W., Zhou, K., et al. (2014). Imagining the unseen: Stability-based cuboid arrangements for scene understanding. ACM Transactions on Graphics (TOG), 33, 209.
Shi, Q. Y., & Ks, Fu. (1983). Parsing and translation of (attributed) expansive graph languages for scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(5), 472–485.
Silberman, N., Kohli, P., Hoiem, D. & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In ECCV.
Tu, Z., Chen, X., Yuille, A. L., & Zhu, S. C. (2005). Image parsing: Unifying segmentation, detection, and recognition. International Journal of Computer Vision, 63, 113.
Wales, D. (2004). Energy landscapes: Applications to clusters, biomolecules and glasses. Cambridge: Cambridge Molecular Science, Cambridge University Press.
Wu, C., Lenz, I., & Saxena, A. (2014). Hierarchical semantic labeling for task-relevant rgb-d perception. In Robotics: Science and systems (RSS).
Zhao, Y., & Zhu, S. C. (2011). Image parsing via stochastic scene grammar. In NIPS.
Zheng, B., Takamatsu, J., & Ikeuchi, K. (2010). An adaptive and stable method for fitting implicit polynomial curves and surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3), 561–568.
Zheng, B., Zhao, Y., Yu, J. C., Ikeuchi, K., & Zhu, S. C. (2013). Beyond point cloud: Scene understanding by reasoning geometry and physics. In CVPR.
Zheng, B., Zhao, Y., Yu, J. C., Ikeuchi, K., & Zhu, S. C. (2014). Detecting potential falling objects by inferring human action and natural disturbance. In IEEE international conference on robotics and automation (ICRA).
Acknowledgments
This work is supported by (1) MURI ONR N00014-10-1-0933 and DARPA MSEE grant FA 8650-11-1-7149, USA, (2) Next-generation Energies for Tohoku Recovery (NET) and SCOPE Program of Ministry of Internal Affairs and Communications, Japan, (3) and the 10-th core Project Grant of Microsoft Japan.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Derek Hoiem, James Hays, Jianxiong Xiao, and Aditya Khosla.
Rights and permissions
About this article
Cite this article
Zheng, B., Zhao, Y., Yu, J. et al. Scene Understanding by Reasoning Stability and Safety. Int J Comput Vis 112, 221–238 (2015). https://doi.org/10.1007/s11263-014-0795-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-014-0795-4