Stable reinforcement learning via temporal competition between LTP and LTD traces
© Huertas et al; licensee BioMed Central Ltd. 2014
Published: 21 July 2014
Neuronal systems that are involved in reinforcement learning must solve the temporal credit assignment problem, i.e., how is a stimulus associated with a reward that is delayed in time? Theoretical studies [1–3] have postulated that neural activity underlying learning ‘tags’ synapses with an ‘eligibility trace’, and that the subsequent arrival of a reward converts the eligibility traces into actual modification of synaptic efficacies. While eligibility traces provide one simple solution to the temporal credit assignment problem, they alone do not constitute a stable learning rule because there is no other mechanism indicating when learning should cease. In order to attain stability, rules involving eligibility traces often assume that once the association is learned, further learning is prevented via an inhibition of the reward stimulus [1, 3, 4].
Although synaptic plasticity is responsible for reinforcement learning in the brain, theories of reinforcement learning are generally abstract and involve neither neurons nor synapses. Furthermore, biophysical theories of synaptic plasticity typically model unsupervised learning and ignore the contribution of reinforcement. Here we describe a biophysically based theory of reinforcement-modulated synaptic plasticity and postulate the existence of two eligibility traces with different temporal profiles: one corresponding to the induction of LTP, and the other to the induction of LTD. The traces have different kinetics and their difference in magnitude at the time of reward determines if synaptic modification will correspond to LTP or LTD. Due to the difference in their decay rates, the LTP and LTD traces can exhibit temporal competition at the reward time and thus provides a mechanism for stable reinforcement learning without the need to inhibit reward. We test this novel reinforcement-learning rule on an experimentally motivated model of a recurrent cortical network , and compare the model results to experimental results at both the cellular and circuit levels. We further suggest that these eligibility traces are implemented via kinases and phosphatases, thus accounting for results at both the cellular and system levels.
- Sutton RS, Barto AG: Reinforcement Learning. 1990, Cambridge, MA: MIT PressGoogle Scholar
- Izhikevich EM: Solving the distal reward problem through linkage of STDP and dopamine signaling. Cereb Cortex. 2007, 17 (10): 2443-2452. 10.1093/cercor/bhl152.View ArticlePubMedGoogle Scholar
- Gavornik JP, Shuler MG, Loewenstein Y, Bear MF, Shouval HZ: Learning reward timing in cortex through reward dependent expression of synaptic plasticity. Proc Natl Acad Sci U S A. 2009, 106 (16): 6826-31. 10.1073/pnas.0901835106.PubMed CentralView ArticlePubMedGoogle Scholar
- Rescorla RA, Wagner AR: A theory of Pavlovian conditioning: The effectiveness of reinforcement and non-reinforcement. Classical Conditioning II: Current Research and Theory. Edited by: AH Black & WF Prokasy. 1972, New York: Appleton-Century-Crofts, 64-69.Google Scholar
- Shuler MG, Bear MF: Reward timing in the primary visual cortex. Science. 2006, 311 (5767): 1606-9. 10.1126/science.1123513.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.