Extending Sliding-Step Importance Weighting from Supervised Learning to Reinforcement Learning

Tian, Tian; Sutton, Richard S.

doi:10.1007/978-3-030-56150-5_4

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12158))

Included in the following conference series:

International Joint Conference on Artificial Intelligence

380 Accesses
1 Citations

Abstract

Stochastic gradient descent (SGD) has been in the center of many advances in modern machine learning. SGD processes examples sequentially, updating a weight vector in the direction that would most reduce the loss for that example. In many applications, some examples are more important than others and, to capture this, each example is given a non-negative weight that modulates its impact. Unfortunately, if the importance weights are highly variable they can greatly exacerbate the difficulty of setting the step-size parameter of SGD. To ease this difficulty, Karampatziakis and Langford [6] developed a class of elegant algorithms that are much more robust in the face of highly variable importance weights in supervised learning. In this paper we extend their idea, which we call “sliding step”, to reinforcement learning, where importance weighting can be particularly variable due to the importance sampling involved in off-policy learning algorithms. We compare two alternative ways of doing the extension in the linear function approximation setting, then introduce specific sliding-step versions of the TD(0) and Emphatic TD(0) learning algorithms. We prove the convergence of our algorithms and demonstrate their effectiveness on both on-policy and off-policy problems. Overall, our new algorithms appear to be effective in bringing the robustness of the sliding-step technique from supervised learning to reinforcement learning.

2nd Scaling-Up Reinforcement Learning (SURL) Workshop, IJCAI 2019.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Beygelzimer, A., Dasgupta, S., Langford, J.: Importance weighted active learning. In: Proceedings of the 26th International Conference on Machine Learning (ICML), pp. 49–56 (2009)
Google Scholar
Bradtke, S.J., Barto, A.G.: Linear least-squares algorithms for temporal difference learning. Mach. Learn. 22, 33–57 (1996)
MATH Google Scholar
Freund, Y., Schapire, R.E.A.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1995)
MathSciNet MATH Google Scholar
Ghiassian, S., Patterson, A., White, M., Sutton, S. R., White, A.: Online off-policy prediction. ArXiv:1811.02597 (2018)
Huang, J., Alexander, J.S., Arthur, G., Karsten, M.B., Bernhard, S.: Correcting sample selection bias by unlabeled data. Adv. Neural Inf. Process. Syst. 19, 601–608 (2006)
Google Scholar
Karampatziakis, N., Langford, J.: Online importance weight aware updates. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 392–399 (2011)
Google Scholar
Precup, D., Sutton, R.S., Dasgupta, S.: Off-Policy temporal-difference learning with function approximation. In: Proceedings of the 18th International Conference on Machine Learning (ICML), pp. 417–424 (2001)
Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
MathSciNet MATH Google Scholar
Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Learn. 3(1), 9–44 (1988)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
MATH Google Scholar
Sutton, R.S., et al.: Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pp. 993–1000 (2009)
Google Scholar
Sutton, R.S., Maei, H.R., Szepesvári, C.: A convergent O(n) temporal-difference algorithm for off-policy learning with linear function approximation. In: Advances in Neural Information Processing Systems 21 (NIPS), pp. 1609–1616 (2008)
Google Scholar
Sutton, R.S., Mahmood, R.A., White, M.: An emphatic approach to the problem of off-policy temporal-difference learning. J. Mach. Learn. Res. 17(73), 1–29 (2016)
MathSciNet MATH Google Scholar
Tsitsiklis, J.N.: Asynchronous stochastic approximation and Q-Learning. Mach. Learn. 16(3), 185–202 (1994)
MATH Google Scholar
Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5), 674–690 (1997)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Reinforcement Learning and Artificial Intelligence, University of Alberta, Edmonton, Canada
Tian Tian & Richard S. Sutton
DeepMind, Edmonton, Canada
Richard S. Sutton

Authors

Tian Tian
View author publications
You can also search for this author in PubMed Google Scholar
Richard S. Sutton
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tian Tian .

Editor information

Editors and Affiliations

Sorbonne University – Sciences, Paris, France
Amal El Fallah Seghrouchni
Bar-Ilan University, Ramat Gan, Israel
David Sarne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tian, T., Sutton, R.S. (2020). Extending Sliding-Step Importance Weighting from Supervised Learning to Reinforcement Learning. In: El Fallah Seghrouchni, A., Sarne, D. (eds) Artificial Intelligence. IJCAI 2019 International Workshops. IJCAI 2019. Lecture Notes in Computer Science(), vol 12158. Springer, Cham. https://doi.org/10.1007/978-3-030-56150-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-56150-5_4
Published: 18 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-56149-9
Online ISBN: 978-3-030-56150-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics