Skip to main content

Deep Recurrent Policy Networks for Planning Under Partial Observability

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2019: Theoretical Neural Computation (ICANN 2019)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11727))

Included in the following conference series:

  • 2934 Accesses

Abstract

QMDP-net is a recurrent network architecture that combines the features of model-free learning and model-based planning for planning under partial observability. The architecture represents a policy by connecting a partially observable Markov decision process (POMDP) model with the QMDP algorithm that uses value iteration to handle the POMDP model. However, as the value iteration used in QMDP iterates through the entire state space, it may suffer from the “curse of dimensionality”. Besides, as the policies based on the QMDP will not take actions to gain information, this may lead to bad policies in domains where information gathering is necessary. To address these two issues, this paper introduces two deep recurrent policy networks, asynchronous QMDP-net and ReplicatedQ-net, based on the plain QMDP-net. The former takes advantage of the idea of asynchronous update into the value iteration process of QMDP to learn a smaller abstract state space representation for planning. The latter partially replaces the QMDP with the replicated Q-learning algorithm to take informative actions. Experimental results demonstrate the proposed networks perform better than the plain QMDP-net on the robotic tasks in simulation.

This work is in part supported by the National Natural Science Foundation of China under Grant Nos. 61876119 and 61502323, and the Natural Science Foundation of Jiangsu under Grant No. BK20181432.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Russell, S.J., Norvig, P.: Artificial Intelligence - A Modern Approach, 2nd Edn. Prentice Hall (2003). https://doi.org/10.1016/0004-3702(96)00007-0

    Article  Google Scholar 

  2. Spaan, M.T.J., Vlassis, N.A.: Perseus: randomized point-based value iteration for POMDPs. J. Artif. Intell. Res. 24, 195–220 (2005). https://doi.org/10.1613/jair.1659

    Article  MATH  Google Scholar 

  3. Karkus, P., Hsu, D., Lee, W.S.: QMDP-Net: deep learning for planning under partial observability. In: 30th Advances Neural Information Processing Systems (NIPS), pp. 4697–4707. arXiv preprint arXiv:1703.06692 (2017)

  4. Sondik, E.J.: The optimal control of partially observable markov processes over the infinite horizon: discounted costs. Oper. Res. 26, 282–304 (1978). https://doi.org/10.1287/opre.26.2.282

    Article  MathSciNet  MATH  Google Scholar 

  5. Lovejoy, W.S.: Computationally feasible bounds for partially observed markov decision processes. Oper. Res. 39, 162–175 (1991). https://doi.org/10.1287/opre.39.1.162

    Article  MathSciNet  MATH  Google Scholar 

  6. Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101, 99–134 (1998). https://doi.org/10.1016/S0004-3702(98)00023-X

    Article  MathSciNet  MATH  Google Scholar 

  7. Kurniawati, H., Hsu, D., Lee, W.S.: SARSOP: efficient point-based POMDP planning by approximating optimally reachable belief spaces. In: Robotics: Science and Systems (2008). https://doi.org/10.15607/rss.2008.iv.009

  8. Pineau, J., Gordon, G.J., Thrun, S.: Applying metric-trees to belief-point POMDPs. In: 16th Advances Neural Information Processing Systems (NIPS), pp. 759–766 (2003)

    Google Scholar 

  9. Silver, D., Veness, J.: Monte-carlo planning in large POMDPs. In: 24th Advances Neural Information Processing Systems (NIPS), pp. 2164–2172 (2010)

    Google Scholar 

  10. Ye, N., Somani, A., Hsu, D., Lee, W.S.: DESPOT: online POMDP planning with regularization. J. Artif. Intell. Res. 58, 231–266 (2017). https://doi.org/10.1613/jair.5328

    Article  MathSciNet  MATH  Google Scholar 

  11. Bertsekas, D.P.: Distributed asynchronous computation of fixed points. Math. Program. 27, 107–120 (1983). https://doi.org/10.1007/bf02591967

    Article  MathSciNet  MATH  Google Scholar 

  12. Littman, M.L., Cassandra, A.R., Kaelbling, L.P.: Learning policies for partially observable environments: scaling up. In: 12th International Conference on Machine Learning (ICML), pp. 362–370 (1995). https://doi.org/10.1016/b978-1-55860-377-6.50052-9

    Chapter  Google Scholar 

  13. Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957)

    MATH  Google Scholar 

  14. Pan, Z., Zhang, Z., Chen, Z.: Asynchronous value iteration network. In: Cheng, L., Leung, A.C.S., Ozawa, S. (eds.) ICONIP 2018. LNCS, vol. 11302, pp. 169–180. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04179-3_15

    Chapter  Google Scholar 

  15. Moore, A.W., Atkeson, C.G.: Prioritized sweeping: reinforcement learning with less data and less time. Mach. Learn., 103–130 (1993). https://doi.org/10.1007/bf00993104

    Google Scholar 

  16. Barto, A.G., Bradtke, S.J., Singh, S.P.: Learning to act using real-time dynamic programming. Artif. Intell. 72, 81–138 (1995). https://doi.org/10.1016/0004-3702(94)00011-O

    Article  Google Scholar 

  17. Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. In: 4th International Conference on Learning Representations (ICLR). arXiv preprint arXiv:1511.05952 (2016)

  18. Aviv, T., Yi, W., Garrett, T., Sergey, L., Pieter, A.: Value iteration networks. In: 29th Advances in Neural Information Processing Systems (NIPS), pp. 2154–2162 (2016)

    Google Scholar 

  19. Oh, J., Singh, S., Lee, H.: Value prediction network. In: 30th Advances in Neural Information Processing Systems (NIPS), pp. 6120–6130 (2017)

    Google Scholar 

  20. Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mapping and planning for visual navigation. In: 35th Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7272–7281 (2017). https://doi.org/10.1109/cvpr.2017.769

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zixuan Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, Z., Zhang, Z. (2019). Deep Recurrent Policy Networks for Planning Under Partial Observability. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Theoretical Neural Computation. ICANN 2019. Lecture Notes in Computer Science(), vol 11727. Springer, Cham. https://doi.org/10.1007/978-3-030-30487-4_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-30487-4_46

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30486-7

  • Online ISBN: 978-3-030-30487-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics