Skip to main content

A Policy Iteration Algorithm for Learning from Preference-Based Feedback

  • Conference paper
Advances in Intelligent Data Analysis XII (IDA 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8207))

Included in the following conference series:

Abstract

Conventional approaches to reinforcement learning assume availability of a numerical feedback signal, but in many domains, this is difficult to define or not available at all. The recently proposed framework of preference-based reinforcement learning relaxes this condition by replacing the quantitative reward signal with qualitative preferences over trajectories. In this paper, we show how to estimate preferences over actions from preferences over trajectories. These action preferences can then be used to learn a preferred policy. The performance of this new approach is evaluated by a comparison with SARSA in three common reinforcement learning benchmark problems, namely mountain car, inverted pendulum, and acrobot. The results are showing convergence rates that are comparable, but achieved with a much less time consuming tuning of the setup.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Akrour, R., Schoenauer, M., Sebag, M.: APRIL: Active preference learning-based reinforcement learning. In: Flach, P.A., De Bie, T., Cristianini, N. (eds.) ECML PKDD 2012, Part II. LNCS, vol. 7524, pp. 116–131. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  2. Audibert, J.Y., Bubeck, S.: Minimax policies for adversarial and stochastic bandits. In: Proceedings of the 22nd Conference on Learning Theory (COLT 2009), Montreal, Quebec, Canada, pp. 773–818 (2009)

    Google Scholar 

  3. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3), 235–256 (2002)

    Article  MATH  Google Scholar 

  4. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: Gambling in a rigged casino: The adversarial multi-arm bandit problem. In: Proceedings of the 36th Annual Symposium on Foundations of Computer Science, pp. 322–331 (1995)

    Google Scholar 

  5. Dimitrakakis, C., Lagoudakis, M.G.: Rollout sampling approximate policy iteration. Machine Learning 72(3), 157–171 (2008)

    Article  Google Scholar 

  6. Fürnkranz, J., Hüllermeier, E. (eds.): Preference Learning. Springer (2010)

    Google Scholar 

  7. Fürnkranz, J., Hüllermeier, E., Cheng, W., Park, S.H.: Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine Learning 89(1-2), 123–156 (2012), special Issue of Selected Papers from ECML PKDD 2011

    Google Scholar 

  8. Hastie, T., Tibshirani, R.: Classification by pairwise coupling. The Annals of Statistics 26, 451–471 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  9. Price, D., Knerr, S., Personnaz, L., Dreyfus, G.: Pairwise neural network classifiers with probabilistic outputs. In: Proceedings of the 7th Conference Advances in Neural Information Processing Systems (NIPS 1994), vol. 7, pp. 1109–1116. MIT Press (1994)

    Google Scholar 

  10. Rothkopf, C.A., Dimitrakakis, C.: Preference elicitation and inverse reinforcement learning. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part III. LNCS, vol. 6913, pp. 34–48. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  11. Singh, S.P., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning 38(3), 287–308 (2000)

    Article  MATH  Google Scholar 

  12. Sutton, R.S., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)

    Google Scholar 

  13. Wirth, C., Fürnkranz, J.: Learning from trajectory-based action preferences. In: Proceedings of the ICRA 2013 Workshop on Autonomous Learning (to appear, May 2013)

    Google Scholar 

  14. Wu, T.F., Lin, C.J., Weng, R.C.: Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research 5, 975–1005 (2004)

    MathSciNet  MATH  Google Scholar 

  15. Zhao, Y., Kosorok, M., Zeng, D.: Reinforcement learning design for cancer clinical trials. Statistics in Medicine 28, 3295–3315 (2009)

    MathSciNet  Google Scholar 

  16. Wilson, A., Fern, A., Tadepalli, P.: A Bayesian Approach for Policy Learning from Trajectory Preference Queries. Advances in Neural Information Processing Systems 25, 1142–1150 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wirth, C., Fürnkranz, J. (2013). A Policy Iteration Algorithm for Learning from Preference-Based Feedback. In: Tucker, A., Höppner, F., Siebes, A., Swift, S. (eds) Advances in Intelligent Data Analysis XII. IDA 2013. Lecture Notes in Computer Science, vol 8207. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41398-8_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-41398-8_37

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-41397-1

  • Online ISBN: 978-3-642-41398-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics