Skip to main content
  • Poster presentation
  • Open access
  • Published:

Direct reinforcement learning, spike time dependent plasticity and the BCM rule

Learning agents, whether natural or artificial, must update their internal parameters in order to improve their behavior over time. In reinforcement learning, this plasticity is influenced by an environmental signal, termed a reward, which directs the changes in appropriate directions. We model a network of spiking neurons as a Partially Observed Markov Decision Process (POMDP) and apply a recently introduced policy learning algorithm from Machine Learning to the network [1]. Based on computing a stochastic gradient approximation of the average reward, we derive a plasticity rule falling in the class of Spike Time Dependent Plasticity (STDP) rules, which ensures convergence to a local maximum of the average reward. The approach is applicable to a broad class of neuronal models, including the Hodgkin-Huxley model. The obtained update rule is based on the correlation between the reward signal and local data available at the synaptic site. This data depends on local activity (e.g., pre and post synaptic spikes) and requires mechanisms that are available at the cellular level. Simulations on several toy problems demonstrate the utility of the approach. Like most stochastic gradient based methods, the convergence rate is slow, even though the percentage of convergence to global maxima is high. Additionally, through statistical analysis we show that the synaptic plasticity rule established is closely related to the widely used BCM rule [2], for which good biological evidence exists. The relation to the BCM rule captures the nature of the relation between pre and post synaptic spiking rates, and in particular the self-regularizing nature of the BCM rule. Compared to previous work in this field, our model is more realistic than the one used in [3], and the derivation of the update rule applies to a broad class of voltage based neuronal models, eliminating some of the additional statistical assumptions required in [4]. Finally, the connection between Reinforcement Learning and the BCM rule is, to the best of our knowledge, new.


  1. Baxter J, Bartlett PL: Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research. 2001, 15: 319-350. 10.1016/S0954-1810(01)00028-0.

    Google Scholar 

  2. Bienenstock EL, Cooper LN, Munro PW: Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J Neurosci. 1982, 2 (1): 32-48.

    PubMed  CAS  Google Scholar 

  3. Bartlett PL, Baxter J: Hebbian synaptic modifications in spiking neurons that learn. Technical report, Reasearch School of Information Sciences and Engineering. 1999, Australian National University

    Google Scholar 

  4. Xie X, Seung HS: Learning in neural networks by reinforcement of irregular spiking. Physical Review E. 2004, 69: 041909-10.1103/PhysRevE.69.041909.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Ron Meir.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Barash, D., Meir, R. Direct reinforcement learning, spike time dependent plasticity and the BCM rule. BMC Neurosci 8 (Suppl 2), P197 (2007).

Download citation

  • Published:

  • DOI: