Deep Recurrent Policy Networks for Planning Under Partial Observability

Chen, Zixuan; Zhang, Zongzhang

doi:10.1007/978-3-030-30487-4_46

Zixuan Chen¹² &
Zongzhang Zhang¹³

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11727))

Included in the following conference series:

International Conference on Artificial Neural Networks

2934 Accesses

Abstract

QMDP-net is a recurrent network architecture that combines the features of model-free learning and model-based planning for planning under partial observability. The architecture represents a policy by connecting a partially observable Markov decision process (POMDP) model with the QMDP algorithm that uses value iteration to handle the POMDP model. However, as the value iteration used in QMDP iterates through the entire state space, it may suffer from the “curse of dimensionality”. Besides, as the policies based on the QMDP will not take actions to gain information, this may lead to bad policies in domains where information gathering is necessary. To address these two issues, this paper introduces two deep recurrent policy networks, asynchronous QMDP-net and ReplicatedQ-net, based on the plain QMDP-net. The former takes advantage of the idea of asynchronous update into the value iteration process of QMDP to learn a smaller abstract state space representation for planning. The latter partially replaces the QMDP with the replicated Q-learning algorithm to take informative actions. Experimental results demonstrate the proposed networks perform better than the plain QMDP-net on the robotic tasks in simulation.

This work is in part supported by the National Natural Science Foundation of China under Grant Nos. 61876119 and 61502323, and the Natural Science Foundation of Jiangsu under Grant No. BK20181432.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Russell, S.J., Norvig, P.: Artificial Intelligence - A Modern Approach, 2nd Edn. Prentice Hall (2003). https://doi.org/10.1016/0004-3702(96)00007-0
Article Google Scholar
Spaan, M.T.J., Vlassis, N.A.: Perseus: randomized point-based value iteration for POMDPs. J. Artif. Intell. Res. 24, 195–220 (2005). https://doi.org/10.1613/jair.1659
Article MATH Google Scholar
Karkus, P., Hsu, D., Lee, W.S.: QMDP-Net: deep learning for planning under partial observability. In: 30th Advances Neural Information Processing Systems (NIPS), pp. 4697–4707. arXiv preprint arXiv:1703.06692 (2017)
Sondik, E.J.: The optimal control of partially observable markov processes over the infinite horizon: discounted costs. Oper. Res. 26, 282–304 (1978). https://doi.org/10.1287/opre.26.2.282
Article MathSciNet MATH Google Scholar
Lovejoy, W.S.: Computationally feasible bounds for partially observed markov decision processes. Oper. Res. 39, 162–175 (1991). https://doi.org/10.1287/opre.39.1.162
Article MathSciNet MATH Google Scholar
Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101, 99–134 (1998). https://doi.org/10.1016/S0004-3702(98)00023-X
Article MathSciNet MATH Google Scholar
Kurniawati, H., Hsu, D., Lee, W.S.: SARSOP: efficient point-based POMDP planning by approximating optimally reachable belief spaces. In: Robotics: Science and Systems (2008). https://doi.org/10.15607/rss.2008.iv.009
Pineau, J., Gordon, G.J., Thrun, S.: Applying metric-trees to belief-point POMDPs. In: 16th Advances Neural Information Processing Systems (NIPS), pp. 759–766 (2003)
Google Scholar
Silver, D., Veness, J.: Monte-carlo planning in large POMDPs. In: 24th Advances Neural Information Processing Systems (NIPS), pp. 2164–2172 (2010)
Google Scholar
Ye, N., Somani, A., Hsu, D., Lee, W.S.: DESPOT: online POMDP planning with regularization. J. Artif. Intell. Res. 58, 231–266 (2017). https://doi.org/10.1613/jair.5328
Article MathSciNet MATH Google Scholar
Bertsekas, D.P.: Distributed asynchronous computation of fixed points. Math. Program. 27, 107–120 (1983). https://doi.org/10.1007/bf02591967
Article MathSciNet MATH Google Scholar
Littman, M.L., Cassandra, A.R., Kaelbling, L.P.: Learning policies for partially observable environments: scaling up. In: 12th International Conference on Machine Learning (ICML), pp. 362–370 (1995). https://doi.org/10.1016/b978-1-55860-377-6.50052-9
Chapter Google Scholar
Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957)
MATH Google Scholar
Pan, Z., Zhang, Z., Chen, Z.: Asynchronous value iteration network. In: Cheng, L., Leung, A.C.S., Ozawa, S. (eds.) ICONIP 2018. LNCS, vol. 11302, pp. 169–180. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04179-3_15
Chapter Google Scholar
Moore, A.W., Atkeson, C.G.: Prioritized sweeping: reinforcement learning with less data and less time. Mach. Learn., 103–130 (1993). https://doi.org/10.1007/bf00993104
Google Scholar
Barto, A.G., Bradtke, S.J., Singh, S.P.: Learning to act using real-time dynamic programming. Artif. Intell. 72, 81–138 (1995). https://doi.org/10.1016/0004-3702(94)00011-O
Article Google Scholar
Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. In: 4th International Conference on Learning Representations (ICLR). arXiv preprint arXiv:1511.05952 (2016)
Aviv, T., Yi, W., Garrett, T., Sergey, L., Pieter, A.: Value iteration networks. In: 29th Advances in Neural Information Processing Systems (NIPS), pp. 2154–2162 (2016)
Google Scholar
Oh, J., Singh, S., Lee, H.: Value prediction network. In: 30th Advances in Neural Information Processing Systems (NIPS), pp. 6120–6130 (2017)
Google Scholar
Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mapping and planning for visual navigation. In: 35th Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7272–7281 (2017). https://doi.org/10.1109/cvpr.2017.769

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Soochow University, Suzhou, China
Zixuan Chen
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Zongzhang Zhang

Authors

Zixuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zongzhang Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zixuan Chen .

Editor information

Editors and Affiliations

Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Igor V. Tetko
Institute of Computer Science, Czech Academy of Sciences, Prague 8, Czech Republic
Věra Kůrková
Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Pavel Karpov
Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Fabian Theis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Z., Zhang, Z. (2019). Deep Recurrent Policy Networks for Planning Under Partial Observability. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Theoretical Neural Computation. ICANN 2019. Lecture Notes in Computer Science(), vol 11727. Springer, Cham. https://doi.org/10.1007/978-3-030-30487-4_46

Download citation

DOI: https://doi.org/10.1007/978-3-030-30487-4_46
Published: 09 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30486-7
Online ISBN: 978-3-030-30487-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics