Skip to main content

Reinforcement Learning and Markov Decision Processes

  • Chapter
Reinforcement Learning

Part of the book series: Adaptation, Learning, and Optimization ((ALO,volume 12))

Abstract

Situated in between supervised learning and unsupervised learning, the paradigm of reinforcement learning deals with learning in sequential decision making problems in which there is limited feedback. This text introduces the intuitions and concepts behind Markov decision processes and two classes of algorithms for computing optimal behaviors: reinforcement learning and dynamic programming. First the formal framework of Markov decision process is defined, accompanied by the definition of value functions and policies. The main part of this text deals with introducing foundational classes of algorithms for learning optimal behaviors, based on various definitions of optimality with respect to the goal of learning sequential decisions. Additionally, it surveys efficient extensions of the foundational algorithms, differing mainly in the way feedback given by the environment is used to speed up learning, and in the way they concentrate on relevant parts of the problem. For both model-based and model-free settings these efficient extensions have shown useful in scaling up to larger problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 299.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 379.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 379.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Bain, M., Sammut, C.: A framework for behavioral cloning. In: Muggleton, S.H., Furakawa, K., Michie, D. (eds.) Machine Intelligence, vol. 15, pp. 103–129. Oxford University Press (1995)

    Google Scholar 

  • Barto, A.G., Sutton, R.S., Anderson, C.W.: Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics 13, 835–846 (1983)

    Google Scholar 

  • Barto, A.G., Bradtke, S.J., Singh, S.: Learning to act using real-time dynamic programming. Artificial Intelligence 72(1), 81–138 (1995)

    Google Scholar 

  • Bellman, R.E.: Dynamic Programming. Princeton University Press, Princeton (1957)

    MATH  Google Scholar 

  • Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. 1, 2. Athena Scientific, Belmont (1995)

    MATH  Google Scholar 

  • Bertsekas, D.P., Tsitsiklis, J.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)

    MATH  Google Scholar 

  • Bonet, B., Geffner, H.: Faster heuristic search algorithms for planning with uncertainty and full feedback. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1233–1238 (2003a)

    Google Scholar 

  • Bonet, B., Geffner, H.: Labeled RTDP: Improving the convergence of real-time dynamic programming. In: Proceedings of the International Conference on Artificial Intelligence Planning Systems (ICAPS), pp. 12–21 (2003b)

    Google Scholar 

  • Boutilier, C.: Knowledge Representation for Stochastic Decision Processes. In: Veloso, M.M., Wooldridge, M.J. (eds.) Artificial Intelligence Today. LNCS (LNAI), vol. 1600, pp. 111–152. Springer, Heidelberg (1999)

    Google Scholar 

  • Boutilier, C., Dean, T., Hanks, S.: Decision theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research 11, 1–94 (1999)

    MathSciNet  MATH  Google Scholar 

  • Brafman, R.I., Tennenholtz, M.: R-MAX - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research (JMLR) 3, 213–231 (2002)

    MathSciNet  Google Scholar 

  • Dean, T., Kaelbling, L.P., Kirman, J., Nicholson, A.: Planning under time constraints in stochastic domains. Artificial Intelligence 76, 35–74 (1995)

    Google Scholar 

  • Dixon, K.R., Malak, M.J., Khosla, P.K.: Incorporating prior knowledge and previously learned information into reinforcement learning agents. Tech. rep., Institute for Complex Engineered Systems, Carnegie Mellon University (2000)

    Google Scholar 

  • Dorigo, M., Colombetti, M.: Robot Shaping: An Experiment in Behavior Engineering. The MIT Press, Cambridge (1997)

    Google Scholar 

  • Drescher, G.: Made-Up Minds: A Constructivist Approach to Artificial Intelligence. The MIT Press, Cambridge (1991)

    MATH  Google Scholar 

  • Ferguson, D., Stentz, A.: Focussed dynamic programming: Extensive comparative results. Tech. Rep. CMU-RI-TR-04-13, Robotics Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania (2004)

    Google Scholar 

  • Främling, K.: Bi-memory model for guiding exploration by pre-existing knowledge. In: Driessens, K., Fern, A., van Otterlo, M. (eds.) Proceedings of the ICML-2005 Workshop on Rich Representations for Reinforcement Learning, pp. 21–26 (2005)

    Google Scholar 

  • Großmann, A.: Adaptive state-space quantisation and multi-task reinforcement learning using constructive neural networks. In: From Animals to Animats: Proceedings of The International Conference on Simulation of Adaptive Behavior (SAB), pp. 160–169 (2000)

    Google Scholar 

  • Hansen, E.A., Zilberstein, S.: LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence 129, 35–62 (2001)

    MathSciNet  MATH  Google Scholar 

  • Howard, R.A.: Dynamic Programming and Markov Processes. The MIT Press, Cambridge (1960)

    MATH  Google Scholar 

  • Kaelbling, L.P.: Learning in Embedded Systems. The MIT Press, Cambridge (1993)

    Google Scholar 

  • Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)

    Google Scholar 

  • Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. In: Proceedings of the International Conference on Machine Learning (ICML) (1998)

    Google Scholar 

  • Koenig, S., Liu, Y.: The interaction of representations and planning objectives for decision-theoretic planning. Journal of Experimental and Theoretical Artificial Intelligence 14(4), 303–326 (2002)

    MATH  Google Scholar 

  • Konda, V., Tsitsiklis, J.: Actor-critic algorithms. SIAM Journal on Control and Optimization 42(4), 1143–1166 (2003)

    MathSciNet  MATH  Google Scholar 

  • Konidaris, G.: A framework for transfer in reinforcement learning. In: ICML-2006 Workshop on Structural Knowledge Transfer for Machine Learning (2006)

    Google Scholar 

  • Kushmerick, N., Hanks, S., Weld, D.S.: An algorithm for probabilistic planning. Artificial Intelligence 76(1-2), 239–286 (1995)

    Google Scholar 

  • Littman, M.L., Dean, T., Kaelbling, L.P.: On the complexity of solving Markov decision problems. In: Proceedings of the National Conference on Artificial Intelligence (AAAI), pp. 394–402 (1995)

    Google Scholar 

  • Mahadevan, S.: Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning 22, 159–195 (1996)

    Google Scholar 

  • Maloof, M.A.: Incremental rule learning with partial instance memory for changing concepts. In: Proceedings of the International Joint Conference on Neural Networks, pp. 2764–2769 (2003)

    Google Scholar 

  • Mataric, M.J.: Reward functions for accelerated learning. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 181–189 (1994)

    Google Scholar 

  • Matthews, W.H.: Mazes and Labyrinths: A General Account of their History and Developments. Longmans, Green and Co., London (1922); Mazes & Labyrinths: Their History & Development. Dover Publications, New York (reprinted in 1970)

    Google Scholar 

  • McMahan, H.B., Likhachev, M., Gordon, G.J.: Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 569–576 (2005)

    Google Scholar 

  • Moore, A.W., Atkeson, C.G.: Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning 13(1), 103–130 (1993)

    Google Scholar 

  • Ng, A.Y., Harada, D., Russell, S.J.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 278–287 (1999)

    Google Scholar 

  • Peng, J., Williams, R.J.: Incremental multi-step Q-learning. Machine Learning 22, 283–290 (1996)

    Google Scholar 

  • Puterman, M.L.: Markov Decision Processes—Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York (1994)

    MATH  Google Scholar 

  • Puterman, M.L., Shin, M.C.: Modified policy iteration algorithms for discounted Markov decision processes. Management Science 24, 1127–1137 (1978)

    MathSciNet  MATH  Google Scholar 

  • Ratitch, B.: On characteristics of Markov decision processes and reinforcement learning in large domains. PhD thesis, The School of Computer Science, McGill University, Montreal (2005)

    Google Scholar 

  • Reynolds, S.I.: Reinforcement learning with exploration. PhD thesis, The School of Computer Science, The University of Birmingham, UK (2002)

    Google Scholar 

  • Rummery, G.A.: Problem solving with reinforcement learning. PhD thesis, Cambridge University, Engineering Department, Cambridge, England (1995)

    Google Scholar 

  • Rummery, G.A., Niranjan, M.: On-line Q-Learning using connectionist systems. Tech. Rep. CUED/F-INFENG/TR 166, Cambridge University, Engineering Department (1994)

    Google Scholar 

  • Russell, S.J., Norvig, P.: Artificial Intelligence: a Modern Approach, 2nd edn. Prentice Hall, New Jersey (2003)

    Google Scholar 

  • Schaeffer, J., Plaat, A.: Kasparov versus deep blue: The re-match. International Computer Chess Association Journal 20(2), 95–101 (1997)

    Google Scholar 

  • Schwartz, A.: A reinforcement learning method for maximizing undiscounted rewards. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 298–305 (1993)

    Google Scholar 

  • Singh, S., Jaakkola, T., Littman, M., Szepesvari, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning 38(3), 287–308 (2000)

    MATH  Google Scholar 

  • Sutton, R.S.: Learning to predict by the methods of temporal differences. Machine Learning 3, 9–44 (1988)

    Google Scholar 

  • Sutton, R.S.: Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 216–224 (1990)

    Google Scholar 

  • Sutton, R.S.: DYNA, an integrated architecture for learning, planning and reacting. In: Working Notes of the AAAI Spring Symposium on Integrated Intelligent Architectures, pp. 151–155 (1991a)

    Google Scholar 

  • Sutton, R.S.: Reinforcement learning architectures for animats. In: From Animals to Animats: Proceedings of The International Conference on Simulation of Adaptive Behavior (SAB), pp. 288–296 (1991b)

    Google Scholar 

  • Sutton, R.S.: Generalization in reinforcement learning: Successful examples using sparse coarse coding. In: Proceedings of the Neural Information Processing Conference (NIPS), pp. 1038–1044 (1996)

    Google Scholar 

  • Sutton, R.S., Barto, A.G.: Reinforcement Learning: an Introduction. The MIT Press, Cambridge (1998)

    Google Scholar 

  • Tash, J., Russell, S.J.: Control strategies for a stochastic planner. In: Proceedings of the National Conference on Artificial Intelligence (AAAI), pp. 1079–1085 (1994)

    Google Scholar 

  • Watkins, C.J.C.H.: Learning from delayed rewards. PhD thesis, King’s College, Cambridge, England (1989)

    Google Scholar 

  • Watkins, C.J.C.H., Dayan, P.: Q-learning. Machine Learning 8(3/4) (1992); Special Issue on Reinforcement Learning

    Google Scholar 

  • Wiering, M.A.: Explorations in efficient reinforcement learning. PhD thesis, Faculteit der Wiskunde, Informatica, Natuurkunde en Sterrenkunde, Universiteit van Amsterdam (1999)

    Google Scholar 

  • Wiering, M.A.: Model-based reinforcement learning in dynamic environments. Tech. Rep. UU-CS-2002-029, Institute of Information and Computing Sciences, University of Utrecht, The Netherlands (2002)

    Google Scholar 

  • Wiering, M.A.: QV(λ)-Learning: A new on-policy reinforcement learning algorithm. In: Proceedings of the 7th European Workshop on Reinforcement Learning (2005)

    Google Scholar 

  • Wiering, M.A., Schmidhuber, J.H.: Efficient model-based exploration. In: From Animals to Animats: Proceedings of The International Conference on Simulation of Adaptive Behavior (SAB), pp. 223–228 (1998a)

    Google Scholar 

  • Wiering, M.A., Schmidhuber, J.H.: Fast online Q(λ). Machine Learning 33(1), 105–115 (1998b)

    MATH  Google Scholar 

  • Winston, W.L.: Operations Research Applications and Algorithms, 2nd edn. Thomson Information/Publishing Group, Boston (1991)

    MATH  Google Scholar 

  • Witten, I.H.: An adaptive optimal controller for discrete-time markov environments. Information and Control 34, 286–295 (1977)

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martijn van Otterlo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

van Otterlo, M., Wiering, M. (2012). Reinforcement Learning and Markov Decision Processes. In: Wiering, M., van Otterlo, M. (eds) Reinforcement Learning. Adaptation, Learning, and Optimization, vol 12. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27645-3_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-27645-3_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-27644-6

  • Online ISBN: 978-3-642-27645-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics