Abstract
This article proposes a three-timescale simulation based algorithm for solution of infinite horizon Markov Decision Processes (MDPs). We assume a finite state space and discounted cost criterion and adopt the value iteration approach. An approximation of the Dynamic Programming operator T is applied to the value function iterates. This ‘approximate’ operator is implemented using three timescales, the slowest of which updates the value function iterates. On the middle timescale we perform a gradient search over the feasible action set of each state using Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates, thus finding the minimizing action in T. On the fastest timescale, the ‘critic’ estimates, over which the gradient search is performed, are obtained. A sketch of convergence explaining the dynamics of the algorithm using associated ODEs is also presented. Numerical experiments on rate based flow control on a bottleneck node using a continuous-time queueing model are performed using the proposed algorithm. The results obtained are verified against classical value iteration where the feasible set is suitably discretized. Over such a discretized setting, a variant of the algorithm of missing data is compared and the proposed algorithm is found to converge faster.
Chapter PDF
Similar content being viewed by others
References
Abdulla, M.S., and Bhatnagar, S. Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes. Submitted.
Bertsekas, D.P. Dynamic Programming and Stochastic Control, 1976 New York: Academic Press.
Bhatnagar, S., and Abdulla, M.S. Reinforcement Learning Based Algorithms for Finite Horizon Markov Decision Processes. Submitted.
Bhatnagar, S., and Kumar, S. A Simultaneous Perturbation Stochastic Approximation-Based Actor-Critic Algorithm for Markov Decision Processes. IEEE Trans, on Automatic Control, 2004, 49(4):592–598.
Bhatnagar, S., et al., Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Trans, on Modeling and Computer Simulation, 2003, 13(4): 180–209.
Choi, D. S., and Van Roy, B. A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning, Submitted to Discrete Event Dynamic Systems.
De Farias, D.P., and Van Roy, B. On the Existence of Fixed Points for Approximate Value Iteration and Temporal-Difference Learning. Journal of Optimization Theory and Applications, June 2000, 105(3).
Konda, V.R., and Borkar, V.S. Actor-Critic Type Learning Algorithms for Markov Decision Processes. SIAM J. Control Optim., 1999, 38(1): 94–123.
Konda, V.R., and Tsitsiklis, J.N. Actor-Critic Algorithms. SIAM J. Control Optim., 2003, 42(4): 1143–1166.
Singh, S., and Bertsekas, D. Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems. Advances in Neural Information Processing Systems (NIPS), 1997,9:974–980.
Tsitsiklis, J. N., and Van Roy, B. Optimal Stopping of Markov Processes: Hilbert Space Theory, Approximation Algorithms, and an Application to Pricing High-Dimensional Financial Derivatives. IEEE Trans, on Automatic Control, 1999, 44(10): 1840–1851.
Van Roy, B., et. al., A Neuro-Dynamic Programming Approach to Retailer Inventory Management, 1997, Proc. of the IEEE Conf. on Decision and Control.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 International Federation for Information Processing
About this paper
Cite this paper
Abdulla, M.S., Bhatnagar, S. (2005). Solution of Mdps Using Simulation-Based Value Iteration. In: Li, D., Wang, B. (eds) Artificial Intelligence Applications and Innovations. AIAI 2005. IFIP — The International Federation for Information Processing, vol 187. Springer, Boston, MA. https://doi.org/10.1007/0-387-29295-0_83
Download citation
DOI: https://doi.org/10.1007/0-387-29295-0_83
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-28318-0
Online ISBN: 978-0-387-29295-3
eBook Packages: Computer ScienceComputer Science (R0)