Abstract
In this paper we address the problem of simultaneous learning and coordination in multiagent Markov decision problems (MMDPs) with infinite state-spaces. We separate this problem in two distinct subproblems: learning and coordination. To tackle the problem of learning, we survey Q-learning with soft-state aggregation (Q-SSA), a well-known method from the reinforcement learning literature (Singh et al. in Advances in neural information processing systems. MIT Press, Cambridge, vol 7, pp 361–368, 1994). Q-SSA allows the agents in the game to approximate the optimal Q-function, from which the optimal policies can be computed. We establish the convergence of Q-SSA and introduce a new result describing the rate of convergence of this method. In tackling the problem of coordination, we start by pointing out that the knowledge of the optimal Q-function is not enough to ensure that all agents adopt a jointly optimal policy. We propose a novel coordination mechanism that, given the knowledge of the optimal Q-function for an MMDP, ensures that all agents converge to a jointly optimal policy in every relevant state of the game. This coordination mechanism, approximate biased adaptive play (ABAP), extends biased adaptive play (Wang and Sandholm in Advances in neural information processing systems. MIT Press, Cambridge, vol 15, pp 1571–1578, 2003) to MMDPs with infinite state-spaces. Finally, we combine Q-SSA with ABAP, this leading to a novel algorithm in which learning of the game and coordination take place simultaneously. We discuss several important properties of this new algorithm and establish its convergence with probability 1. We also provide simple illustrative examples of application.
Similar content being viewed by others
References
Bernstein D. S., Zilberstein S., Immerman N. (2002) The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research 27(4): 819–840
Bertsekas D. P., Tsitsiklis J. N. (1996) Neuro-dynamic programming optimization and neural computation series. Athena Scientific, Belmont, MA
Boutilier, C. (1999). Sequential optimality and coordination in multiagent systems. In Proceedings of the 16th international joint conference on artificial intelligence (IJCAI’99) (pp. 478–485).
Boutilier, C. (1996). Planning, learning and coordination in multiagent decision processes. In Proceedings of the 6th conference on theoretical aspects of rationality and knowledge (TARK-96) (pp. 195–210)
Bowling, M. (2000). Convergence problems of general-sum multiagent reinforcement learning. In Proceedings of the 17th international conference on machine learning (ICML’00) (pp 89–94). Morgan Kaufman.
Bowling, M., & Veloso, M. (2000a). An analysis of stochastic game theory for multiagent reinforcement learning. Technical Report CMU-CS-00-165, School of Computer Science, Carnegie Mellon University.
Bowling, M., & Veloso, M. (2000b). Scalable learning in stochastic games. In Proceedings of the AAAI workshop on game theoretic and decision theoretic agents (GTDT’02) (pp. 11–18). The AAAI Press, Published as AAAI Technical Report WS-02-06.
Bowling, M., & Veloso, M. (2001). Rational and convergent learning in stochastic games. In Proceedings of the 17th international joint conference on artificial intelligence (IJCAI’01) (pp. 1021–1026).
Bowling M., Veloso M. (2002) Multi-agent learning using a variable learning rate. Artificial Intelligence 136: 215–250
Brown G. W. (1949) Some notes on computation of games solutions. Research Memoranda RM-125-PR. RAND Corporation, Santa Monica
Claus, C., & Boutilier, C. (1998). The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the 15th national conference on artificial intelligence (AAAI’98) (pp. 746–752).
Crites R. H., Barto A. G. (1998) Elevator group control using multiple reinforcement learning agents. Machine Learning 33(2–3): 235–262
Duflo, M. (1997). Random iterartive Models. In Applications of Mathematics (Vol. 34). Springer.
Durfee E. H., Lesser V. R., Corkill D. D. (1987) Coherent cooperation among communicating problem solvers. IEEE Transactions on Computers 36(11): 1275–1291
Even-Dar E., Mansour Y. (2003) Learning rates for Q-learning. Journal of Machine Learning Research 5: 1–25
Gmytrasiewicz P., Doshi P. (2005) A framework for sequential planning in multiagent settings. Journal of Artificial Intelligence Research 24: 49–79
Gordon, G. J. (1995). Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, School of Computer Science, Carnegie Mellon University.
Guestrin, C., Lagoudakis, M. G., & Parr, R. (2002). Coordinated reinforcement learning. In Proceedings of the 19th international conference on machine learning (ICML’02) (pp, 227–234).
Hu J., Wellman M. P. (2003) Nash Q-learning for general sum stochastic games. Journal of Machine Learning Research 4: 1039–1069
Kearns, M., & Singh, S. (1999). Finite-sample convergence rates for Q-learning and indirect algorithms. In M. J. Kearns, S. A. Solla, & D. A. Cohn, (Eds.), Advances in neural information processing systems (Vol. 11, pp. 996–1002). Cambridge, MA: MIT Press.
Kok J. R., Spaan, M. T. J., & Vlassis, N. (2002). An approach to noncommunicative multiagent coordination in continuous domains. In: M. Wiering, (Ed.), Benelearn 2002: Proceedings of the 12th Belgian–Dutch conference on machine learning (pp. 46–52). Utrecht, The Netherlands.
Leslie D. S., Collins E. J. (2006) Generalised weakened fictitious play. Games and Economic Behavior 56: 285–298
Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In R. López de Mántaras, & D. Poole (Eds.), Proceedings of the 11th international conference on machine learning (ICML’94) (pp. 157–163). San Francisco, CA: Morgan Kaufmann.
Littman M. L. (2001) Value-function reinforcement learning in Markov games. Journal of Cognitive Systems Research 2(1): 55–66
Littman, M. L. (2001b). Friend-or-foe Q-learning in general-sum games. In Proceedings of the 18th international conference on machine learning (ICML’01) (pp. 322–328). San Francisco, CA: Morgan Kaufmann.
Melo, F. S., & Ribeiro, M. I. (2007a). Rational and convergent model-free adaptive learning for team Markov games. Technical Report RT-601-07, Institute for Systems and Robotics, February.
Melo, F. S., & Ribeiro, M. I. (2007b). Learning to coordinate in topological navigation tasks. In Proceedings of the 6th IFAC symposium on intelligent autonomous vehicles (IAV’07) (to appear), September.
Melo, F. S., & Ribeiro, M. I. (2008). Emerging coordination in infinite team Markov games. In Proceedings of the 7th international conference on autonomous agents and multiagent systems (AAMAS’08) (pp. 355–362).
Melo, F. S., & Veloso, M. (2009). Learning of coordination: Exploiting sparse interactions in multiagent systems. In Proceedings of the 8th international conference on autonomous agents and multiagent systems (AAMAS’08) (pp. 773–780).
Melo, F. S., Meyn, S. P., & Ribeiro, M. I. (2008). An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on machine learning (ICML’08) (pp. 664–671).
Meyn, S. P., & Tweedie, R. L. (1993). Markov chains and stochastic stability. Communicatons and Control Engineering Series. New York: Springer.
Nash J. F. (1950) Equilibrium points in n-person games. Proceedings of the National Academy of Sciences 36: 48–49
Ormoneit D., Sen Ś. (2002) Kernel-based reinforcement learning. Machine Learning 49: 161–178
Pelletier M. (1998) On the almost sure asymptotic behaviour of stochastic algorithms. Stochastic Processes and their Applications 78: 217–244
Perkins T. J., Precup D. (2003) A convergent form of approximate policy iteration. In: Thrun S., Becker S., Obermayer K. (eds) Advances in neural information processing systems. MIT Press, Cambridge, MA, pp 1595–1602
Robinson J. (1951) An iterative method of solving a game. Annals of Mathematics 54: 296–301
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3), 210–229. Reprinted in IBM Journal of Research and Development, 44(1/2), 206–226, 2000.
Samuel A. L. (1967) Some studies in machine learning using the game of checkers II: Recent progress. IBM Journal of Research and Development 11: 601–617
Sen S., Weiß G. (1999) Learning in multiagent systems, chapter 6. MIT Press, Cambridge, MA, pp 259–298
Singh, S. P., Jaakkola, T., & Jordan, M. I. (1994). Reinforcement learning with soft state aggregation. In Advances in neural information processing systems (Vol. 7, pp. 361–368). Cambridge, MA: MIT Press.
Singh, S. P., Kearns, M., & Mansour, Y. (2000). Nash convergence of gradient dynamics in general-sum games. In Proceedings of the 16th conference on uncertainty in artificial intelligence (UAI’00) (pp. 541–548).
Sutton R. S., Barto A. G. (1998) Reinforcement learning: An introduction. Adaptive computation and machine learning series (3rd ed.). MIT Press, Cambridge, MA
Szepesvári C. (1997) The asymptotic convergence rates for Q-learning. Proceedings of Neural Information Processing Systems (NIPS’97) 10: 1064–1070
Szepesvári C., Littman M. L. (1999) A unified analysis of value-function-based reinforcement learning algorithms. Neural Computation 11(8): 2017–2059
Szepesvári, C., & Smart, W. D. (2004). Interpolation-based Q-learning. In Proceedings of the 21st international conference on machine learning (ICML’04) (pp. 100–107). New York, USA: ACM Press, July.
Tesauro G. (1994) TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation 6(2): 215–219
Tesauro G. (1995) Temporal difference learning and TD-Gammon. Communications of the ACM 38(3): 58–68
Tong H., Brown T. X. (2000) Reinforcement learning for call admission control and routing under quality of service constraints in multimedia networks. Machine Learning 49(2–3): 111–139
Tsitsiklis J. N., Athans M. (1985) On the complexity of decentralized decision making and detection problems. IEEE Transactions on Automatic Control AC 30(5): 440–446
Tsitsiklis J. N., Van Roy B. (1996) Feature-based methods for large scale dynamic programming. Machine Learning 22: 59–94
Tsitsiklis J. N., Van Roy B. (1996) An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42(5): 674–690
Uther, W., & Veloso, M. (2003). Adversarial reinforcement learning. Technical Report CMU-CS-03-107, School of Computer Science, Carnegie Mellon University, January.
Wang X., Sandholm T. (2003) Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In: Becker S., Thrun S., Obermayer K. (eds) Advances in neural information processing systems. MIT Press, Cambridge, MA, pp 1571–1578
Watkins, C. J. C. H. (1989). Learning from delayed rewards. PhD thesis, King’s College, University of Cambridge, May.
Young H. P. (1993) The evolution of conventions. Econometrica 61(1): 57–84
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Melo, F.S., Ribeiro, M.I. Coordinated learning in multiagent MDPs with infinite state-space. Auton Agent Multi-Agent Syst 21, 321–367 (2010). https://doi.org/10.1007/s10458-009-9104-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10458-009-9104-y