Abstract
We consider the active learning problem of inferring the transition model of a Markov Decision Process by acting and observing transitions. This is particularly useful when no reward function is a priori defined. Our proposal is to cast the active learning task as a utility maximization problem using Bayesian reinforcement learning with belief-dependent rewards. After presenting three possible performance criteria, we derive from them the belief-dependent rewards to be used in the decision-making process. As computing the optimal Bayesian value function is intractable for large horizons, we use a simple algorithm to approximately solve this optimization problem. Despite the sub-optimality of this technique, we show experimentally that our proposal is efficient in a number of domains.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Araya-López, M., Buffet, O., Thomas, V., Charpillet, F.: A POMDP extension with belief-dependent rewards. In: Advances in Neural Information Processing Systems 23 (NIPS 2010) (2010)
Asmuth, J., Li, L., Littman, M., Nouri, A., Wingate, D.: A Bayesian sampling approach to exploration in reinforcement learning. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI 2009) (2009)
Bellman, R.: The theory of dynamic programming. Bull. Amer. Math. Soc. 60, 503–516 (1954)
Brafman, R., Tennenholtz, M.: R-max - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research 3, 213–231 (2003)
Şimşek, O., Barto, A.G.: An intrinsic reward mechanism for efficient exploration. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 833–840. ACM, New York (2006)
Dimitrakakis, C.: Tree exploration for Bayesian RL exploration. In: CIMCA/IAWTIC/ISE, pp. 1029–1034 (2008)
Duff, M.: Optimal learning: Computational procedures for Bayes-adaptive Markov decision processes. Ph.D. thesis, University of Massachusetts Amherst (2002)
Gittins, J.C.: Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society 41(2), 148–177 (1979)
Jonsson, A., Barto, A.G.: Active Learning of Dynamic Bayesian Networks in Markov Decision Processes. In: Miguel, I., Ruml, W. (eds.) SARA 2007. LNCS (LNAI), vol. 4612, pp. 273–284. Springer, Heidelberg (2007)
Kolter, J., Ng, A.: Near-Bayesian exploration in polynomial time. In: Proceedings of the Twenty-Sixth International Conference on Machine Learning, ICML 2009 (2009)
Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Proceedings of the Sixteenth International Conference on Machine Learning, pp. 278–287. Morgan Kaufmann (1999)
Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: Proceedings of the Twenty-Third International Conference on Machine Learning (ICML 2006) (2006)
Puterman, M.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience (April 1994)
Rauber, T., Braun, T., Berns, K.: Probabilistic distance measures of the Dirichlet and Beta distributions. Pattern Recognition 41(2), 637–645 (2008)
Roy, N., Thrun, S.: Coastal navigation with mobile robots. In: Advances in Neural Information Processing Systems 12, pp. 1043–1049 (1999)
Sorg, J., Singh, S., Lewis, R.: Variance-based rewards for approximate Bayesian reinforcement learning. In: Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (2010)
Strens, M.J.A.: A Bayesian framework for reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML 2000), pp. 943–950 (2000)
Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press (1998)
Szepesvári, C.: Reinforcement learning algorithms for MDPs – a survey. Tech. Rep. TR09-13, Department of Computing Science, University of Alberta (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Araya-López, M., Buffet, O., Thomas, V., Charpillet, F. (2012). Active Learning of MDP Models. In: Sanner, S., Hutter, M. (eds) Recent Advances in Reinforcement Learning. EWRL 2011. Lecture Notes in Computer Science(), vol 7188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29946-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-29946-9_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29945-2
Online ISBN: 978-3-642-29946-9
eBook Packages: Computer ScienceComputer Science (R0)