Abstract
Monte Carlo tree search (MCTS) has recently been drawing great interest in the domain of planning and learning under uncertainty. One of the fundamental challenges is the trade-off between exploration and exploitation. To address this problem, we propose to balance between exploration and exploitation via posterior sampling in the contexts of Markov decision process (MDP) and partially observable Markov decision process (POMDP). Specifically, we treat the cumulative reward returned by taking an action from a search node in the MCTS search tree as a random variable following an unknown distribution. We parametrize this distribution by introducing necessary hidden parameters, and infer the posterior distribution of the hidden parameters in a Bayesian way. We further expand a node in the search tree by using Thompson sampling to select an action based on its posterior probability of being optimal. Following this idea, we develop Dirichlet-NormalGamma based Monte Carlo tree search (DNG-MCTS) and Dirichlet-Dirichlet-NormalGamma based partially observable Monte Carlo planning (D2NG-POMCP) algorithms respectively for Monte Carlo planning in MDPs and POMDPs. Experimental results show that the proposed algorithms outperform the state-of-the-art with better values on several benchmark problems.
Similar content being viewed by others
Notes
References
Agrawal S, Goyal N (2012) Analysis of thompson sampling for the multi-armed bandit problem. In: Conference on learning theory, pp 39.1–39.26
Agrawal S, Goyal N (2013) Further optimal regret bounds for Thompson sampling. In: Artificial intelligence and statistics, pp 99–107
Anand A, Mausam GA, Singla P (2015) ASAP-UCT: Abstraction of state-action pairs in UCT. In: Yang Q, Wooldridge M (eds) IJCAI. AAAI Press, pp 1509–1515
Anand A, Mausam RN, Singla P (2016) OGA-UCT: On-the-go abstractions in UCT. In: Coles AJ, Coles A, Edelkamp S, Magazzeni D, Sanner S (eds) ICAPS. AAAI Press, pp 29– 37
Asmuth J, Littman ML (2011) Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search. In: Uncertainty in artificial intelligence, pp 19–26
Auer P (2003) Using confidence bounds for exploitation-exploration trade-offs. J Mach Learn Res 3:397–422
Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Mach Learn 47(2):235– 256
Bai A, Srivastava S, Russell S (2016) Markovian state and action abstractions for MDPs via hierarchical MCTS. In: 25th international joint conference on artificial intelligence (IJCAI). New York
Bai A, Wu F, Chen X (2012) Online planning for large MDPs with MAXQ decomposition (extended abstract). In: van der Hoek W, Padgham L, Conitzer V, Winikoff M (eds) International conference on autonomous agents and multiagent systems, AAMAS 2012, Valencia, Spain, June 4-8, 2012 (3 volumes). IFAAMAS, pp 1215–1216
Bai A, Wu F, Chen X (2013) Bayesian Mixture modelling and inference based Thompson sampling in Monte-Carlo tree search. In: Advances in neural information processing systems 26, pp 1646–1654
Bai A, Wu F, Chen X (2015) Online planning for large Markov decision processes with hierarchical decomposition. ACM Trans Intell Syst Technol (TIST) 6(4):45:1–45:28
Bai A, Wu F, Zhang Z, Chen X (2014) Thompson sampling based Monte-Carlo planning in POMDPs. In: International conference on automated planning and scheduling (ICAPS)
Barrett S, Agmon N, Hazon N, Kraus S, teammates P. Stone. (2014) Communicating with unknown. In: Proceedings of 13th international conference on autonomous agents and multiagent systems (AAMAS 2012)
Barrett S, Stone P, Kraus S, Rosenfeld A (2013) Teamwork with limited knowledge of teammates. In: Proceedings of the twenty-seventh AAAI conference on artificial intelligence
Barto A, Bradtke S, Singh S (1995) Learning to act using real-time dynamic programming. Artif Intell 72(1-2):81–138
Bellman R (1957) Dynamic programming, 1st edn. Princeton University Press, Princeton
Bertsekas DP, Castanon DA (1999) Rollout algorithms for stochastic scheduling problems. J Heuristics 5 (1):89–108
Bonet B, Geffner H (2003) Labeled rtdp: Improving the convergence of real-time dynamic programming. In: International conference on automated planning and scheduling, vol 3
Bonet B, Geffner H (2012) Action selection for MDPs Anytime AO* vs. UCT. In: AAAI conference on artificial intelligence, pp 1749–1755
Browne C, Powley EJ, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S (2012) A survey of Monte Carlo, tree search methods. IEEE Trans Comput Intell AI Games 4(1):1–43
Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found Trends Mach Learn 5(1):1–122
Bubeck S, Munos R, Stoltz G (2011) Pure exploration in finitely-armed and continuous-armed bandits. Theor Comput Sci 412(19):1832–1852
Chang HS, Givan R, Chong EK (2004) Parallel rollout for online solution of partially observable Markov decision processes. Discret Event Dyn Syst 14(3):309–341
Chapelle O, Li L (2011) An empirical evaluation of Thompson sampling. In: Advances neural information processing systems, pp 2249–2257
Chaslot G, Bakkes S, Szita I, Spronck P (2008) Monte-Carlo tree search: a new framework for game AI. In: Darken C, Mateas M (eds) Proceedings of the fourth artificial intelligence and interactive digital entertainment conference. The AAAI Press, Stanford
DasGupta A (2008) Asymptotic theory of statistics and probability. Springer, Berlin
Dearden R, Friedman N, Russell S (1998) Bayesian Q-learning. In: AAAI conference on artificial intelligence, pp 761–768
DeGroot MH, Schervish MJ (2002) Probability and statistics. Addison Wesley, Boston
Dietterich TG (1999) Hierarchical reinforcement learning with the MAXQ value function decomposition. J Mach Learn Res 13(1):63
Eyerich P, Keller T, Helmert M (2010) High-quality policies for the Canadian traveler’s problem. In: AAAI conference on artificial intelligence, pp 51–58
Feldman Z, Domshlak C (2012) Simple regret optimization in online planning for Markov decision processes. In: AAAI conference on artificial intelligence
Feldman Z, Domshlak C (2014) On MABs and separation of concerns in Monte-Carlo planning for MDPs. In: Chien SA, Do MB, Fern A, Ruml W (eds) ICAPS. AAAI
Feng Z, Hansen E (2002) Symbolic heuristic search for factored Markov decision processes. In: AAAI/IAAI, pp 455–460
Finnsson H, Björnsson Y (2008) Simulation-based approach to general game playing. AAAI 8:259–264
Forbes C, Evans M (2011). In: Hastings N, Peacock B (eds) Statistical distributions. Wiley, Nwe York
Gelly S, Silver D (2007) Combining online and offline knowledge in UCT. In: Proceedings of the 24th international conference on machine learning. ACM, pp 273–280
Gelly S, Silver D (2011) Monte-Carlo Tree search and rapid action value estimation in computer Go. Artif Intell 175(11):1856–1875
Gopalan A, Mannor S, Mansour Y (2014) Thompson sampling for complex online problems. In: Proceedings of the 31st international conference on machine learning, pp 100–108
Gordon NJ, Salmond DJ, Smith AF (1993) Novel approach to nonlinear/non-Gaussian bayesian state estimation. In: IEE Proceedings F (radar and signal processing), vol 140. IET, pp 107–113
Grzes M, Poupart P (2014) Pomdp planning and execution in an augmented space. In: Proceedings of the 2014 international conference on autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and Multiagent Systems, pp 757–764
Grześ M, Poupart P, Hoey J (2013) Isomorph-free branch and bound search for finite state controllers. In: Proceedings of the twenty-third international joint conference on artificial intelligence. AAAI Press, pp 2282–2290
Guez A, Silver D, Dayan P (2012) Efficient Bayes-adaptive reinforcement learning using sample-based search. In: Advances in neural information processing systems, pp 1034–1042
Hansen E, Zilberstein S (2001) LAO* A heuristic search algorithm that finds solutions with loops. Artif Intell 129(1-2):35–62
Jaynes ET (1968) Prior probabilities. IEEE Trans Syst Sci Cybern 4(3):227–241
Jones GL (2004) On the Markov chain central limit theorem. Probab Surv 1:299–320
Kaelbling LP, Littman ML, Cassandra AR (1998) Planning and acting in partially observable stochastic domains. Artif Intell 101(1-2):99–134
Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285
Kaufmann E, Korda N, Munos R (2012) Thompson sampling An optimal finite time analysis. In: Algorithmic Learning Theory, pp 199–213
Kearns M, Mansour Y, Ng A (1999) A sparse sampling algorithm for near-optimal planning in large Markov decision processes. In: Proceedings of the 16th international joint conference on artificial intelligence, vol 2. Morgan Kaufmann Publishers Inc, pp 1324–1331
Keller T, Eyerich P (2012) Prost: Probabilistic planning based on UCT. In: ICAPS12
Keller T, Helmert M (2013) Trial-based heuristic tree search for finite horizon MDPs. In: Proceedings of the 23rd international conference on automated planning and scheduling (ICAPS), pp 135–143
Kocsis L, Szepesvári C (2006) Bandit based Monte-Carlo planning. In: European conference on machine learning, pp 282–293
Korda N, Kaufmann E, Munos R (2013) Thompson sampling for 1-dimensional exponential family bandits. In: Burges C, Bottou L, Welling M, Ghahramani Z, Weinberger K (eds) Advances in neural information processing systems 26. Curran Associates, Inc, pp 1448–1456
Kurniawati H, Hsu D, Lee WS (2008) SARSOP efficient point-based POMDP planning by approximating optimally reachable belief spaces. In: Robotics: science and systems, pp 65–72
Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv Appl Math 6:4–22
Macindoe O, Kaelbling LP, Lozano-Pérez T (2012) POMCoP: Belief space planning for sidekicks in cooperative games. In: Riedl M, Sukthankar G (eds) Proceedings of the eighth AAAI conference on artificial intelligence and interactive digital entertainment, AIIDE-12. The AAAI Press, Stanford
McAllester DA, Singh S (1999) Approximate planning for factored pomdps using belief state simplification. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc, pp 409–416
McMahan HB, Likhachev M, Gordon G (2005) Bounded real-time dynamic programming: Rtdp with monotone upper bounds and performance guarantees. In: Proceedings of the 22nd international conference on machine learning. ACM, pp 569–576
Osband I, Russo D, Van Roy B (2013) (more) efficient reinforcement learning via posterior sampling. In: Advances in neural information processing systems, pp 3003–3011
Papadimitriou CH, Yannakakis M (1991) Shortest paths without a map. Theor Comput Sci 84(1):127–150
Paquet S, Chaib-draa B, Ross S (2006) Hybrid POMDP Algorithms. In: Proceedings of the workshop on multi-agent sequential decision making in uncertain domains (MSDM-06). Citeseer, pp 133–147
Paquet S, Tobin L, Chaib-draa B (2005) Real-time decision making for large POMDPs. In Advances in artificial intelligence. Springer, pp 450–455
Pineau J, Gordon G, Thrun S, et al. (2003) Point-based value iteration: an anytime algorithm for POMDPs. In: IJCAI, vol 3, pp 1025–1032
Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York
Ross S, Chaib-Draa B, et al. (2007) Aems: an anytime online search algorithm for approximate policy refinement in large POMDPs. In: IJCAI, pp 2592–2598
Ross S, Pineau J, Paquet S, Chaib-Draa B (2008) Online planning algorithms for POMDPs. J Artif Intell Res 32(1):663–704
Sanner S, Goetschalckx R, Driessens K, Shani G (2009) Bayesian real-time dynamic programming. In: IJCAI, pp 1784–1789
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A et al (2017) Mastering the game of go without human knowledge. Nature 550(7676):354
Silver D, Veness J (2010) Monte-Carlo planning in large POMDPs. In: Advances in neural information processing systems, pp 2164–2172
Smith T, Simmons R (2004) Heuristic search value iteration for POMDPs. In: Proceedings of the 20th conference on uncertainty in artificial intelligence. AUAI Press, pp 520–527
Somani A, Ye N, Hsu D, Lee WS (2013) DESPOT: Online POMDP planning with regularization. In: Burges C, Bottou L, Welling M, Ghahramani Z, Weinberger K (eds) Advances in neural information processing systems 26. Curran Associates, Inc, pp 1772–1780
Sutton RS, Barto AG (1998) Reinforcement learning: An introduction. The MIT Press, Cambridge
Tesauro G, Rajan VT, Segal R (2010) Bayesian inference in Monte-Carlo tree search. In: Uncertainty in artificial intelligence, pp 580–588
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25:285–294
Thrun S (1999) Monte Carlo POMDPs. In: NIPS, vol 12, pp 1064–1070
Tolpin D, Shimony SE (2012) MCTS based on simple regret. In: AAAI conference on artificial intelligence
Vien NA, Ertel W, Dang V -H, Chung T (2013) Monte-Carlo tree search for bayesian reinforcement learning. Appl Intell 39(2):345–353
Wang T, Lizotte D, Bowling M, Schuurmans D (2005) Bayesian sparse sampling for on-line reward optimization. In: Proceedings of the 22nd international conference on machine learning. ACM, pp 956–963
Washington R (1997) BI-POMDP: bounded, incremental partially-observable Markov-model planning. In: Recent advances in AI planning. Springer, pp 440–451
Winands MH, Bjornsson Y, Saito J (2010) Monte Carlo tree search in lines of action. IEEE Trans Comput Intell AI Games 2(4):239–250
Wu F, Zilberstein S, Chen X (2011) Online planning for ad hoc autonomous agent teams. In: International joint conference on artificial intelligence, pp 439–445
Zhang Z, Chen X (2012) FHHOP a factored hybrid heuristic online planning algorithm for large POMDPs. In: Proceedings of the 28th conference on uncertainty in artificial intelligence. Catalina Island, pp 934–943
Acknowledgements
Feng Wu was supported in part by National Natural Science Foundation of China under grant No. 61603368, the Youth Innovation Promotion Association of CAS (No. 2015373), and Natural Science Foundation of Anhui Province under grant No. 1608085QF134. Aijun Bai was supported in part by the National Research Foundation for the Doctoral Program of China under grant 20133402110026, the National Hi-Tech Project of China under grant 2008AA01Z150 and the Natural Science Foundation of China under grant 60745002 and 61175057. We are grateful to the reviewers for their constructive comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bai, A., Wu, F. & Chen, X. Posterior sampling for Monte Carlo planning under uncertainty. Appl Intell 48, 4998–5018 (2018). https://doi.org/10.1007/s10489-018-1248-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-018-1248-5