Which Temporal Difference learning algorithm best reproduces dopamine activity in a multi-choice task?
© Bellot et al; licensee BioMed Central Ltd. 2013
Published: 8 July 2013
The activity of dopaminergic (DA) neurons has been hypothesized to encode a reward prediction error (RPE)  which corresponds to the error signal in Temporal Difference (TD) learning algorithms . This hypothesis has been reinforced by numerous studies showing the relevance of TD learning algorithms to describe the role of basal ganglia in classical conditioning. However, most studies have recorded DA activity during pavlovian conditioning, and thus the exact nature of the signal encoded by DA neurons during a choice remains unclear. In the literature of reinforcement learning different TD learning algorithms predict different RPE during a choice. If the algorithm SARSA predicts a RPE based on the future choice, Q-learning predicts a RPE that will be based on the action that maximize the future amount of reward and V-learning predicts a RPE based on an average of the values of the different available options.
Recent recordings of DA neurons during multi-choice tasks investigated this issue and raised contradictory interpretations on whether DA's RPE signal is action dependent  or not . While the first study suggests that DA neurons encode a RPE compatible with SARSA, results from the second study are interpreted as more consistent with Q-learning . However these studies only proposed a qualitative comparison of the ability of these TD learning algorithms to explain these patterns of activity. In this work we simulated and precisely analyzed these algorithms in relation with previous electrophysiological recordings in a multi-choice task performed by rats . We found that, when fitting the behavior, the simulated algorithms predict a fast convergence of the RPE, incompatible with the observed DA activity, suggesting an apparent dissociation between the signal encoded by dopamine neurons and behavioral adaption of the animals. Further analyses of the evolution of dopamine neurons activity across learning indicated that, complementarily to the RPE, the value function fits well with the activity. However the value function cannot explain the inhibition of DA activity during omission and the global decrease of DA activity during a session at the time of reward delivery. Thus in this task, information about both RPE and value may be conveyed by dopamine activity.
By quantitatively comparing the ability of the different TD learning algorithms, this work shows the limitation of these algorithms to fit both the behavior and the DA activity observed in a multi-choice task, when interpreting DA activity as a RPE only. Unexpectedly we show that a value function better fits DA activity suggesting that DA neurons recorded in this task may encode multiple information.
- Schultz W, Dayan P, Montague PR: A neural substrate of prediction and reward. Science. 1997, 275 (5306): 1593-1599. 10.1126/science.275.5306.1593.View ArticlePubMedGoogle Scholar
- Richard S, Sutton Andrew G, Barto : Introduction to Reinforcement Learning. MIT Press. 1998, 1,Google Scholar
- Morris G, Nevet A, Arkadir D, Vaadia E, Bergman H: Midbrain dopamine neurons encode decisions for future action. Nature neuroscience. 2006, 9 (8): 1057-1063. 10.1038/nn1743.View ArticlePubMedGoogle Scholar
- Roesch MR, Calu DJ, Schoenbaum G: Dopamine neurons encode the better option in rats deciding between differently delayed or sized rewards. Nature Neuroscience. 2007, 10 (12): 1615-1624. 10.1038/nn2013.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.