Models of Human Behavioral Agents in Bandits, Contextual Bandits and RL

Lin, Baihan; Cecchi, Guillermo; Bouneffouf, Djallel; Reinen, Jenna; Rish, Irina

doi:10.1007/978-981-16-1288-6_2

Baihan Lin⁶,
Guillermo Cecchi⁷,
Djallel Bouneffouf⁷,
Jenna Reinen⁷ &
…
Irina Rish⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1369))

Included in the following conference series:

International Workshop on Human Brain and Artificial Intelligence

624 Accesses
7 Citations

Abstract

Artificial behavioral agents are often evaluated based on their consistent behaviors and performance to take sequential actions in an environment to maximize some notion of cumulative reward. However, human decision making in real life usually involves different strategies and behavioral trajectories that lead to the same empirical outcome. Motivated by clinical literature of a wide range of neurological and psychiatric disorders, we propose here a more general and flexible parametric framework for sequential decision making that involves a two-stream reward processing mechanism. We demonstrated that this framework is flexible and unified enough to incorporate a family of problems spanning multi-armed bandits (MAB), contextual bandits (CB) and reinforcement learning (RL), which decompose the sequential decision making process in different levels. Inspired by the known reward processing abnormalities of many mental disorders, our clinically-inspired agents demonstrated interesting behavioral trajectories and comparable performance on simulated tasks with particular reward distributions, a real-world dataset capturing human decision-making in gambling tasks, and the PacMan game across different reward stationarities in a lifelong learning setting (The codes to reproduce all the experimental results can be accessed at https://github.com/doerlbh/mentalRL.).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Agrawal, S., Goyal, N.: Analysis of Thompson Sampling for the multi-armed bandit problem. In: COLT 2012 - The 25th Annual Conference on Learning Theory, Edinburgh, Scotland, 25–27 June 2012, pp. 39.1–39.26 (2012). http://www.jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf
Agrawal, S., Goyal, N.: Thompson sampling for contextual bandits with linear payoffs. In: ICML, no. 3, pp. 127–135 (2013)
Google Scholar
Auer, P., Cesa-Bianchi, N.: On-line learning with malicious noise and the closure algorithm. Ann. Math. Artif. Intell. 23(1–2), 83–99 (1998)
Article MathSciNet Google Scholar
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)
Article Google Scholar
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)
Article MathSciNet Google Scholar
Bayer, H.M., Glimcher, P.W.: Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47(1), 129–141 (2005). https://doi.org/10.1016/j.neuron.2005.05.020. http://www.ncbi.nlm.nih.gov/pubmed/15996553. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC1564381. http://www.linkinghub.elsevier.com/retrieve/pii/S0896627305004678
Bechara, A., Damasio, A.R., Damasio, H., Anderson, S.W.: Insensitivity to future consequences following damage to human prefrontal cortex. Cognition 50(1–3), 7–15 (1994)
Article Google Scholar
Beygelzimer, A., Langford, J., Li, L., Reyzin, L., Schapire, R.: Contextual bandit algorithms with supervised learning guarantees. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 19–26 (2011)
Google Scholar
Bouneffouf, D., Féraud, R.: Multi-armed bandit problem with known trend. Neurocomputing 205, 16–21 (2016). https://doi.org/10.1016/j.neucom.2016.02.052
Bouneffouf, D., Rish, I., Cecchi, G.A.: Bandit models of human behavior: reward processing in mental disorders. In: Everitt, T., Goertzel, B., Potapov, A. (eds.) AGI 2017. LNCS (LNAI), vol. 10414, pp. 237–248. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63703-7_22
Chapter Google Scholar
Bouneffouf, D., Rish, I., Cecchi, G.A., Féraud, R.: Context attentive bandits: contextual bandit with restricted context. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 1468–1475 (2017)
Google Scholar
Chapelle, O., Li, L.: An empirical evaluation of Thompson sampling. In: Advances in Neural Information Processing Systems, pp. 2249–2257 (2011)
Google Scholar
Dayan, P., Niv, Y.: Reinforcement learning: the good, the bad and the ugly. Curr. Opin. Neurobiol. 18(2), 185–196 (2008)
Article Google Scholar
Elfwing, S., Seymour, B.: Parallel reward and punishment control in humans and robots: Safe reinforcement learning using the MaxPain algorithm. In: 2017 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pp. 140–147. IEEE (2017)
Google Scholar
Even-Dar, E., Mansour, Y.: Learning rates for q-learning. J. Mach. Learn. Res. 5, 1–25 (2003)
MathSciNet MATH Google Scholar
Frank, M.J., O’Reilly, R.C.: A mechanistic account of striatal dopamine function in human cognition: psychopharmacological studies with cabergoline and haloperidol. Behav. Neurosci. 120(3), 497–517 (2006). https://doi.org/10.1037/0735-7044.120.3.497
Frank, M.J., Seeberger, L.C., O’reilly, R.C.: By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science 306(5703), 1940–1943 (2004)
Google Scholar
Fridberg, D.J., et al.: Cognitive mechanisms underlying risky decision-making in chronic cannabis users. J. Math. Psychol. 54(1), 28–38 (2010)
Article MathSciNet Google Scholar
Hart, A.S., Rutledge, R.B., Glimcher, P.W., Phillips, P.E.M.: Phasic dopamine release in the rat nucleus accumbens symmetrically encodes a reward prediction error term. J. Neurosci. 34(3), 698–704 (2014). https://doi.org/10.1523/JNEUROSCI.2489-13.2014. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.645.2368&rep=rep1&type=pdf
Hasselt, H.V.: Double q-learning. In: Advances in Neural Information Processing Systems, pp. 2613–2621 (2010)
Google Scholar
Holmes, A.J., Patrick, L.M.: The myth of optimality in clinical neuroscience. Trends Cogn. Sci. 22(3), 241–257 (2018). https://doi.org/10.1016/j.tics.2017.12.006. http://linkinghub.elsevier.com/retrieve/pii/S1364661317302681
Horstmann, A., Villringer, A., Neumann, J.: Iowa gambling task: there is more to consider than long-term outcome. Using a linear equation model to disentangle the impact of outcome and frequency of gains and losses. Front. Neurosci. 6, 61 (2012)
Article Google Scholar
Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985). http://www.cs.utexas.edu/~shivaram
Langford, J., Zhang, T.: The Epoch-Greedy algorithm for contextual multi-armed bandits (2007)
Google Scholar
Langford, J., Zhang, T.: The Epoch-Greedy algorithm for multi-armed bandits with side information. In: Advances in Neural Information Processing Systems, pp. 817–824 (2008)
Google Scholar
Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In: King, I., Nejdl, W., Li, H. (eds.) WSDM, pp. 297–306. ACM (2011). http://dblp.uni-trier.de/db/conf/wsdm/wsdm2011.html#LiCLW11
Lin, B.: Diabolical games: reinforcement learning environments for lifelong learning (2020)
Google Scholar
Lin, B.: Online semi-supervised learning in contextual bandits with episodic reward. arXiv preprint arXiv:2009.08457 (2020)
Lin, B., Bouneffouf, D., Cecchi, G.: Split q learning: reinforcement learning with two-stream rewards. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 6448–6449. AAAI Press (2019)
Google Scholar
Lin, B., Bouneffouf, D., Cecchi, G.: Online learning in iterated prisoner’s dilemma to mimic human behavior. arXiv preprint arXiv:2006.06580 (2020)
Lin, B., Bouneffouf, D., Cecchi, G.A., Rish, I.: Contextual bandit with adaptive feature extraction. In: 2018 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 937–944. IEEE (2018)
Google Scholar
Lin, B., Bouneffouf, D., Reinen, J., Rish, I., Cecchi, G.: A story of two streams: reinforcement learning models from human behavior and neuropsychiatry. In: Proceedings of the Nineteenth International Conference on Autonomous Agents and Multi-Agent Systems, AAMAS 2020, pp. 744–752. International Foundation for Autonomous Agents and Multiagent Systems, May 2020
Google Scholar
Lin, B., Zhang, X.: Speaker diarization as a fully online learning problem in MiniVox. arXiv preprint arXiv:2006.04376 (2020)
Lin, B., Zhang, X.: VoiceID on the fly: a speaker recognition system that learns from scratch. In: INTERSPEECH (2020)
Google Scholar
Maia, T.V., Frank, M.J.: From reinforcement learning models to psychiatric and neurological disorders. Nat. Neurosci. 14(2), 154–162 (2011). https://doi.org/10.1038/nn.2723
Article Google Scholar
O’Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K., Dolan, R.J.: Dissociable roles of ventral and dorsal striatum in instrumental. Science 304, 452–454 (2004). https://doi.org/10.1126/science.1094285. http://www.sciencemag.org/content/304/5669/452.full.html. http://www.sciencemag.org/content/suppl/2004/04/13/304.5669.452.DC1.html. http://www.sciencemag.org/content/304/5669/452.full.html#related-urls. http://www.sciencemag.org/cgi/collection/neuroscience
Perry, D.C., Kramer, J.H.: Reward processing in neurodegenerative disease. Neurocase 21(1), 120–133 (2015)
Article Google Scholar
Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems, vol. 37. University of Cambridge, Department of Engineering Cambridge, England (1994)
Google Scholar
Schultz, W., Dayan, P., Montague, P.R.: A neural substrate of prediction and reward. Science 275(5306), 1593–1599 (1997). https://doi.org/10.1126/science.275.5306.1593. http://www.sciencemag.org/cgi/doi/10.1126/science.275.5306.1593
Seymour, B., Singer, T., Dolan, R.: The neurobiology of punishment. Nat. Rev. Neurosci. 8(4), 300–311 (2007). https://doi.org/10.1038/nrn2119. http://www.nature.com/articles/nrn2119
Steingroever, H., et al.: Data from 617 healthy participants performing the iowa gambling task: a “Many Labs” collaboration. J. Open Psychol. Data 3(1), 340–353 (2015)
Google Scholar
Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning, 1st edn. MIT Press, Cambridge (1998)
MATH Google Scholar
Sutton, R.S., Barto, A.G., et al.: Introduction to Reinforcement Learning, vol. 135. MIT press Cambridge (1998)
Google Scholar
Thompson, W.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)
Article Google Scholar
Tversky, A., Kahneman, D.: The framing of decisions and the psychology of choice. Science 211(4481), 453–458 (1981). https://fenix.tecnico.ulisboa.pt/downloadFile/3779576281111/The framing of decisions and the psychology of choice.pdf

Download references

Author information

Authors and Affiliations

Center for Theoretical Neuroscience, Columbia University, New York, USA
Baihan Lin
IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Guillermo Cecchi, Djallel Bouneffouf & Jenna Reinen
Mila - Quebec AI Institute, University of Montreal, Montreal, Canada
Irina Rish

Authors

Baihan Lin
View author publications
You can also search for this author in PubMed Google Scholar
Guillermo Cecchi
View author publications
You can also search for this author in PubMed Google Scholar
Djallel Bouneffouf
View author publications
You can also search for this author in PubMed Google Scholar
Jenna Reinen
View author publications
You can also search for this author in PubMed Google Scholar
Irina Rish
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Baihan Lin .

Editor information

Editors and Affiliations

Zhejiang University, Hangzhou, China
Yueming Wang

A Further Motivation from Neuroscience

In the following section, we provide further discussion with a literature review on the neuroscience and clinical studies related to the reward processing systems.

Cellular Computation of Reward and Reward Violation. Decades of evidence has linked dopamine function to reinforcement learning via neurons in the midbrain and its connections in the basal ganglia, limbic regions, and cortex. Firing rates of dopamine neurons computationally represent reward magnitude, expectancy, and violations (prediction error) and other value-based signals [39]. This allows an animal to update and maintain value expectations associated with particular states and actions. When functioning properly, this helps an animal develop a policy to maximize outcomes by approaching/choosing cues with higher expected value and avoiding cues associated with loss or punishment. The mechanism is conceptually similar to reinforcement learning widely used in computing and robotics [43], suggesting mechanistic overlap in humans and AI. Evidence of Q-learning and actor-critic models have been observed in spiking activity in midbrain dopamine neurons in primates [6] and in the human striatum using the BOLD signal [36].

Positive vs. Negative Learning Signals. Phasic dopamine signaling represents bidirectional (positive and negative) coding for prediction error signals [19], but underlying mechanisms show differentiation for reward relative to punishment learning [40]. Though representation of cellular-level aversive error signaling has been debated [13], it is widely thought that rewarding, salient information is represented by phasic dopamine signals, whereas reward omission or punishment signals are represented by dips or pauses in baseline dopamine firing [39]. These mechanisms have downstream effects on motivation, approach behavior, and action selection. Reward signaling in a direct pathway links striatum to cortex via dopamine neurons that disinhibit the thalamus via the internal segment of the globus pallidus and facilitate action and approach behavior. Alternatively, aversive signals may have an opposite effect in the indirect pathway mediated by D2 neurons inhibiting thalamic function and ultimately action, as well [16]. Manipulating these circuits through pharmacological measures or disease has demonstrated computationally-predictable effects that bias learning from positive or negative prediction error in humans [17], and contribute to our understanding of perceptible differences in human decision making when differentially motivated by loss or gain [45].

Clinical Implications. Highlighting the importance of using computational models to understand predict disease outcomes, many symptoms of neurological and psychiatric disease are related to biases in learning from positive and negative feedback [35]. Studies in humans have shown that when reward signaling in the direct pathway is over-expressed, this may enhance the value associated with a state and incur pathological reward-seeking behavior, like gambling or substance use. Conversely, when aversive error signals are enhanced, this results in dampening of reward experience and increased motor inhibition, causing symptoms that decrease motivation, such as apathy, social withdrawal, fatigue, and depression. Further, it has been proposed that exposure to a particular distribution of experiences during critical periods of development can biologically predispose an individual to learn from positive or negative outcomes, making them more or less susceptible to risk for brain-based illnesses [21]. These points distinctly highlight the need for a greater understanding of how intelligent systems differentially learn from rewards or punishments, and how experience sampling may impact reinforcement learning during influential training periods.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, B., Cecchi, G., Bouneffouf, D., Reinen, J., Rish, I. (2021). Models of Human Behavioral Agents in Bandits, Contextual Bandits and RL. In: Wang, Y. (eds) Human Brain and Artificial Intelligence. HBAI 2021. Communications in Computer and Information Science, vol 1369. Springer, Singapore. https://doi.org/10.1007/978-981-16-1288-6_2

Download citation

DOI: https://doi.org/10.1007/978-981-16-1288-6_2
Published: 08 April 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-1287-9
Online ISBN: 978-981-16-1288-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Models of Human Behavioral Agents in Bandits, Contextual Bandits and RL

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Further Motivation from Neuroscience

A Further Motivation from Neuroscience

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation