A toolbox for the fast information analysis of multiple-site LFP, EEG and spike train recordings
© Magri et al. 2009
Received: 14 May 2009
Accepted: 16 July 2009
Published: 16 July 2009
Skip to main content
© Magri et al. 2009
Received: 14 May 2009
Accepted: 16 July 2009
Published: 16 July 2009
Information theory is an increasingly popular framework for studying how the brain encodes sensory information. Despite its widespread use for the analysis of spike trains of single neurons and of small neural populations, its application to the analysis of other types of neurophysiological signals (EEGs, LFPs, BOLD) has remained relatively limited so far. This is due to the limited-sampling bias which affects calculation of information, to the complexity of the techniques to eliminate the bias, and to the lack of publicly available fast routines for the information analysis of multi-dimensional responses.
Here we introduce a new C- and Matlab-based information theoretic toolbox, specifically developed for neuroscience data. This toolbox implements a novel computationally-optimized algorithm for estimating many of the main information theoretic quantities and bias correction techniques used in neuroscience applications. We illustrate and test the toolbox in several ways. First, we verify that these algorithms provide accurate and unbiased estimates of the information carried by analog brain signals (i.e. LFPs, EEGs, or BOLD) even when using limited amounts of experimental data. This test is important since existing algorithms were so far tested primarily on spike trains. Second, we apply the toolbox to the analysis of EEGs recorded from a subject watching natural movies, and we characterize the electrodes locations, frequencies and signal features carrying the most visual information. Third, we explain how the toolbox can be used to break down the information carried by different features of the neural signal into distinct components reflecting different ways in which correlations between parts of the neural signal contribute to coding. We illustrate this breakdown by analyzing LFPs recorded from primary visual cortex during presentation of naturalistic movies.
The new toolbox presented here implements fast and data-robust computations of the most relevant quantities used in information theoretic analysis of neural data. The toolbox can be easily used within Matlab, the environment used by most neuroscience laboratories for the acquisition, preprocessing and plotting of neural data. It can therefore significantly enlarge the domain of application of information theory to neuroscience, and lead to new discoveries about the neural code.
Information theory [1, 2], the mathematical theory of communication, has been successfully used to address a number of important questions about sensory coding [3–5]. For example, information theoretic tools have been used to characterize the stimulus selectivity of neurons, by revealing the precise sensory features (or combinations of them) which modulate most reliably the responses of neurons or of neural populations [6–8]. Information theory has been used to investigate whether the fine temporal structure of neural activity contains important information which is lost when neural signals are averaged over longer time scales. These studies have shown that the precise timing of spikes measured with respect to the stimulus onset [5, 9–12], or the relative time of spikes with respect to the ongoing network fluctuations [13, 14] provides important information that cannot be extracted from the spike count. Information theory has also been used to study population codes: several studies have developed measures to quantify the impact of cross-neuronal correlations in population activity [15–19] and have applied them to the study of population coding across various sensory modalities [20–22]. Another application of information theory to neuroscience is the measure of causal information transfer  to quantify the amount of interactions between neural populations .
Information theory has been used widely for the analysis of spike trains from single neurons or from small populations , and is now beginning to be applied systematically to other important domains of neuroscience data analysis, such as the information analysis of analog brain signals like BOLD fMRI responses [25–27] and Local Field Potentials (LFPs) [28–30]. However, the application of information theory to analog brain signals has remained relatively limited compared to the potentials it offers. One reason that has limited a wider use of the information analysis to analog signals is that estimates of information from real neurophysiological data are prone to a systematic error (bias) due to limited sampling . The bias problem can be alleviated by the use of several advanced techniques [5, 32–36] or by combinations of them . However, a detailed implementation and testing of these (often complicated) techniques is in many cases beyond the time resources available to experimental laboratories. Moreover, the performance of bias correction techniques was so far tested thoroughly primarily on spike trains, and much less on analog brain signals. Since the performance of bias correction techniques depends on the statistics of the data , it is crucial that tests and comparisons of the performance of the various techniques are also carried out on analog brain signals. Clearly, the availability of a toolbox that implements several of these accurate information calculation techniques after having validated them on analog brain signals would greatly increase the size of the neuroscience community that uses information theoretic analysis.
Another problem, which is particularly prominent when computing information quantities for multiple parallel recordings of neural activity, is the speed of computation. Multielectrode recordings now allow the simultaneous measurement of the activity of tens to hundreds of neurons, and fMRI experiments allow a broad coverage of the cerebral cortex and recording from a large number of voxels. Therefore speed of calculation is paramount, especially in cases when information theory is used to shed light on the interactions between pairs or small groups of recorded regions. In fact, the number of these subgroups (and thus the time needed to compute their information or interaction) increases fast with the number of recorded regions.
This article aims at meeting the demand for fast and publicly available routines for the information theoretic analysis of several types of brain signals, by accompanying and documenting the first release of the Information Breakdown ToolBox (ibTB). This toolbox can be downloaded from the URL http://www.ibtb.org or from the Additional Files provided with this Article (Additional file 1: ibtb.zip). The toolbox has several key features that make it useful to the neuroscience community and that will widen the domain of application of information theory to neuroscience. In particular (i) it can be used within Matlab (The Mathworks, Natick, MA), one of the most widely used environments for the collection, preprocessing and analysis of neural data, (ii) it is algorithmically optimized for speed of calculation (iii) it implements several up-to-date finite sampling correction procedures (iv) it has been thoroughly tested for the analysis of analog neural signals such as EEGs and LFPs (v) it implements the information breakdown formalisms [17, 18, 37] that are often used to understand how different groups of neurons participate in the encoding of sensory stimuli.
Before proceeding to describe the implementation of the Toolbox and to discuss its use, in the following we will briefly define the basic information quantities and describe their meaning from a neuroscientific perspective.
Consider an experiment in which the experimental subject is presented with N s stimuli s 1,... and the corresponding neural response r is recorded and quantified in a given post-stimulus time-window. The neural response can be quantified in a number of ways depending on the experimental questions to be addressed and on the experimenter's hypotheses and intuition. Here, we will assume that the neural response is quantified as an array r = [r 1,..., r L ] of dimension L. For example, the experimenter may be recording the spiking activity of L different neurons and may be interested in spike count codes. In this case r i would be the number of spikes emitted by cell i on a given trial in the response window. Alternatively, if the experimenter is recording the spiking activity of one neuron and is interested in spike timing codes, the response could be quantified by dividing the post-stimulus response window into L bins of width Δt, so that r i is the number of spikes fired in the i-th time bin . Or, the experimenter may be recording LFPs from L channels and be interested in how their power carries information. In this case r i would be quantified as the LFP power in each recording channel .
Unless otherwise stated, we will assume that the neural response in each element of the response array is discrete. If the signal is analogue in nature – such as for LFP or EEG recordings – we assume it has been discretized into a sufficient number of levels to capture the most significant stimulus-related variations. For example, the power at L distinct frequencies extracted from a single-trial LFP recording will be quantized, for each frequency, into a finite number of levels . This assumption is (unless otherwise stated) necessary for the analysis and the correct functioning of the algorithms in the toolbox. The reason why it is convenient to quantify the neural response as a discrete variable is that it makes it easier to quantify the probabilities necessary for information calculation (see below). For analog signals, the discretization can be circumvented only when there are suitable analytical models for the probability distribution (e.g. a gaussian distribution), in which case particular algorithms can be applied (see e.g. the Gaussian Method detailed below).
where P(r|s) is the probability of observing response r when stimulus s is presented; P(s) is the probability of presentation of stimulus s, and is defined as , N tr (s) being the number of trials available for stimulus s and is the total number of trials to all stimuli. Mutual information quantifies how much of the information capacity provided by stimulus-evoked differences in neural activity is robust to the presence of trial-by-trial response variability. Alternatively, it quantifies the reduction of uncertainty about the stimulus that can be gained from observation of a single trial of the neural response.
The mutual information has a number of important qualities that make it well suited to characterizing how a response is modulated by the stimulus. These advantages have been reviewed extensively [3–5, 25, 27, 38]. Here we briefly summarize a few key advantages. First, as outlined above, it quantifies the stimulus discriminability achieved from a single observation of the response, rather than from averaging responses over many observations. Second, I(S; R) takes into account the full stimulus-response probabilities, which include all possible effects of stimulus-induced responses and noise. Thus, its computation does not require the signal to be modeled as a set of response functions plus noise and can be performed even when such decomposition is difficult. Third, because information theory projects all types of neural signals onto a common scale that is meaningful in terms of stimulus knowledge, it is possible to analyze and combine the information given by different measures of neural activity (for example: spike trains and LFPs) which can have very different signal to noise ratios.
Neural signals recorded from different sites are often found to be correlated. For example, spikes emitted by nearby neurons are often synchronous: the probability of observing near-simultaneous spikes from two nearby neurons is often significantly higher than the product of the probabilities of observing the individual spikes from each neuron [39–41]. LFPs recorded from different electrodes are typically correlated over considerable spatial scales [42, 43]. Moreover, neural signals can be correlated in time as well as in space. In fact, different temporal aspects of the neural activity from the same location are often correlated. For example, spike counts and latencies covary in some systems , and the powers of different LFPs bands recorded from the same location can also exhibit significant correlations .
The ubiquitous presence of correlations of neural activity across both space and time has raised the question of what is the impact of this correlation upon the information about sensory stimuli carried by a combination of distributed sources of neural activity (see [16, 45] for recent reviews). Theoretical studies have suggested that correlations can profoundly affect the information transmitted by neural populations [16, 46]. It is therefore of great interest to quantify the impact of correlations on the information carried by a population of simultaneously recorded neurons. Mutual information, Eq. (1), can tell us about the total information that can be gained by simultaneously observing the L considered neural responses, but not about the specific contribution of correlations to this total value. However, a number of information-theoretic quantities have been developed to quantify how correlations affect the information carried by different neural signals (see below). In our information toolbox we have implemented several of such approaches, which will be briefly define and summarize in the rest of this section.
Before we describe the information theoretic tools for quantifying the impact of correlations on coding, it is useful to briefly define the types of correlations usually considered in the studies of neural population activity.
The conditional probability P ind (r|s) can be computed by taking the product of the marginal probabilities of individual elements of the response array, as in Eq. (4), or alternatively by the empirical "shuffling" procedure described as follows. One generates a new set of shuffled responses to stimulus s by randomly permuting, for each element of the response array, the order of trials collected in response to the stimulus s considered, and then joining together the shuffled responses into a shuffled response array. This shuffling operation leaves each marginal probability P(r j |s) unchanged, while destroying any within-trial noise correlations. The distribution of shuffled responses to a given stimulus s is indicated by P sh (r|s).
Many authors further distinguish noise correlations (which, as explained above, exclude sources of correlations due to shared stimulation) from "signal correlations" , which are correlations entirely attributable to common or related stimulus preferences. Signal correlations manifest themselves in similarities across stimuli between the response profiles of the individual elements of the response array. For example, neurons in different channels may all have a very similar mean response to the stimuli. There are several ways to quantify the amount of signal correlations [17, 37, 47]. We will not report them in this Article, but we will only focus on quantifying their impact on the information carried by the response array (see next section).
The importance of separating noise from signal is, as revealed by theoretical studies, that signal and noise correlations have a radically different impact on the sensory information carried by neural populations (see below). In particular, signal correlations always reduce the information, whereas noise correlations can decrease it, increase it or leave it unchanged, depending on certain conditions [46, 48].
We next describe and define briefly several mathematical techniques to quantify the impact of correlations of information. Several different approaches have been proposed (see e.g. [17, 18, 21, 49, 50]). Here we present one, called the "information breakdown" , which takes the total mutual information I(S; R) and decomposes it into a number of components, each reflecting a different way into which signal and noise correlations contribute to information transmission. We decided to focus on the information breakdown formalism partly because it was developed by one of the authors of this article, and partly because it naturally includes several of the quantities proposed by other investigators [18, 21].
The linear term I lin is the sum of the information provided by each element of the response array. This is a useful reference term because if all the elements of the array were totally independent (i.e. with both null noise and signal correlation) then the total information transmitted by the response array would be equal to I lin .
The amount of synergistic information Syn. The difference between I(S; R) and I lin is called synergy. Positive values of synergy denote the presence of synergistic interaction between elements of the response array, which make the total information greater than the sum of that provided by each element of the response array. Negative values of synergy (called "redundancy") indicate that the elements of the response array carry similar messages, and as a consequence information from the response array is less than the sum of the information provided by each individual element. The synergy can be further broken into the contributions from signal and from noise correlations, as follows.
The signal similarity component Isig-simis negative or zero and quantifies the amount of redundancy specifically due to signals correlation. We note that the negative of Isig-simequals the quantity named ΔI signal which was defined in Ref. .
The stimulus independent correlational term, Icor-ind, reflects the contribution of stimulus-independent correlations. In general, if noise and signal correlations have opposite signs, Icor-indis positive. In this case, stimulus-independent noise correlations increase stimulus discriminability compared to what it would be if noise correlations were absent [17, 48]. If, instead, noise and signal correlations have the same sign, Icor-indis negative and stimuli are less discriminable than the zero noise correlation case. In the absence of signal correlation, Icor-indis zero, whatever the strength of noise correlation.
The stimulus dependent correlational term Icor-depis a term describing the impact of stimulus modulation of noise correlation strength. Icor-depis non-negative, and is greater than zero if and only if the strength of noise correlation is modulated by the stimulus. Icor-depwas first introduced in Ref.  with the name ΔI. Icor-depis an upper bound to the information lost by a downstream system interpreting the neural responses without taking into account the presence of correlations .
Our Information Breakdown Toolbox (ibTB) has been implemented in Matlab (The Mathworks, Natick, MA) and C taking advantage of Matlab's MEX technology. It has been tested on several platforms (Mac OS X, Windows 32 and 64 bits, Ubuntu Linux) and it can be downloaded, together with a documentation for its installation and its use, from our website .
As shown in Figure 2, two routines are available for the pre-processing of the input to entropy.m. buildr.m allows to build the response-matrix, to be fed to entropy.m, starting from L + 1 -long arrays: the first array stores a list specifying, for each trial, which stimulus was presented to the subject while the L remaining arrays specify the L corresponding recorded responses. binr.m allows to discretize continuous response-matrices – prior to calls to entropy.m with the the Direct Method – according to a binning method chosen among a list of available binning options. This list includes the equi-populated binning (which returns responses whose stimulus-independent probabilities are approximately uniform) and different types of equi-spaced discretizations. Users are also given the opportunity to define their own binning strategies by easily linking their binning routine to binr.m (we refer the reader to the documentation to this function for more information on how to define custom binning methods).
Finally, information.m is a wrapper around entropy.m which directly computes the breakdown terms by combining the outputs from this main function. Its input is identical to that of entropy.m except for the list of possible outputs options which can be any or several of the following: I(S; R), I sh (S; R), syn, syn sh , I lin , I sig-sim, I cor , I cor-sh, I cor-ind, I cor-depand I cor-dep-sh.
The Direct Method  for computing information and entropies consists in estimating the probabilities of the discrete neural responses by simply computing the fraction of trials in which each response value is observed, and then by simply plugging into the information and entropy equations the response probabilities estimated in this way. In this subsection, we describe a novel, computationally optimized algorithm for computing the empirical estimates of the probabilities, which is at the core of our toolbox and is the principal reason for its speed.
The steps required for computing (R) according to (7) are the following. First the routine has to run through all of the available trials and compute the C(r) values. The program must then loop through all the N r possible r responses, normalize each C(r) by and sum the C(r) log C(r) values together:
for trial from 1 to
read r in current trial
C(r) → C(r) + 1
for r from 1 to N r
where (R) = ∑ r f(C(r)) and f (·) denotes the function f(x) = x · log x.
First of all, Equation (8) suggests that, instead of normalizing each count C(r) and then summing over r, we can instead first compute (R) and then normalize. This has the advantage of reducing the number of division operations from N r to just one.
This observation suggests that, instead of computing the final value of C(r) we can update (R) directly at each trial, inside the first loop, therefore skipping the second loop over the N r responses. The procedure for computing (R) thus becomes:
for trial from 1 to
read r in current trial
(R) → (R) + f(C(r) + 1) - f(C(r))
C(r) → C(r) + 1
where the length of the loop is determined only by the number of available trials.
The previous procedure can be extended to the direct computation of (R|S), (R) and (R|S). Since the argument of f(·) is always an integer, we can store the values computed for f(·) and use them for the computation of all four entropic quantities. In the current implementation of the toolbox, the values of f(·) persist in memory as long as does not change. Calls to the toolbox with matrices with the same number of trials perform increasingly faster and the maximum computation speed is achieved when all values of f(x) for x = 1,..., have been computed.
The number of products required to compute P ind (s) and, consequently, also the time required to compute P ind (r), (R) and χ (d)(R), increases rapidly together with the number, L, of responses.
The Direct Method relies on the empirical measure of the response probabilities as histograms of the fraction of trials in which each discrete response value was observed. Naturally, this procedure gives a perfect estimate of the information and entropies only if the empirical estimates of the probabilities equal the true probabilities. However, any real experiment only yields a finite number of trials from which these probabilities have to be estimated. The estimated probabilities are thus subject to statistical error and necessarily fluctuate around their true values. If we just plug the empirical probabilities into the information equations (a procedure often called the "plug-in" procedure in the literature), then the finite sampling fluctuations in the probabilities will lead to a systematic error (bias) in the estimates of entropies and information . In some cases, the bias of the plug-in information estimate can be as big as the information value we wish to estimate. It is therefore crucial to remove this bias effectively in order to avoid serious misinterpretations of neural coding data .
Next, we describe and compare four bias correction procedures that we implemented in our toolbox. These procedures, which are among those most widely used in the literature, were selected for inclusion in our toolbox because they are applicable to any type of discretized neural response (whatever its statistics), because they are (in our experience) among the most effective, and because they are guaranteed to converge to the true value of information (or entropy) as the number of trials increases to infinity.
where a and b are free parameters that depend on the stimulus-response probabilities, and are estimated by re-computing the information from fractions of the trials as follows. The dataset is first broken into two random partitions and the information quantities are computed for each sub-partition individually: the average of the two values obtained (for each quantity) from the two partitions provides an estimate corresponding to half of the trials. Similarly, by breaking the data into four random partitions, it is possible to obtain estimates corresponding to a fourth of the trials. Finally, a and b are extrapolated as parameters of the parabolic function passing through the /2 and /4 estimates.
This correction technique computes the linear term a in the expansion (11) through an analytical approximation rather than from the scaling of the data of the QE procedure. This approximation depends on the number of response bins with non-zero probability of being observed which is estimated through a bayesian-like procedure. The implementations of this algorithm in our toolbox follows closely the one originally described in Ref. .
where H sh (R|S) is the shuffle noise entropy, i.e., the noise entropy computed after randomly permuting, independently for each response, the order of trials collected in response to a stimulus (in other words, it is the noise entropy of the distribution p sh (r|s)). I sh (S; R) has the same value of I(S; R) for infinite number of trials but has a much smaller bias for finite , owing to the bias cancelation created by the entropy terms in the right hand side of Eq. (12).
This three shuffled-corrected quantities, syn sh , I cor-shand I cor-dep-sh, converge to the same values of their uncorrected counterparts syn, I cor and I cor-dep, respectively, for infinite number of trials. However the bias of the shuffle-corrected quantities is much smaller when the number of trials is finite. This is especially critical for the computation of I cor-depwhich is by far the most biased term of the information breakdown .
The bootstrap procedure [53, 54] consists of pairing stimuli and responses at random in order to destroy all the information that the responses carry about the stimulus. Because of finite data sampling, the information computed using the bootstrapped responses may still be positive. The distribution of bootstrapped information values (over several instances of random stimulus-response pairings) can be used to build a non-parametric test of whether the information computed using the original responses is significantly different from zero. Moreover, the average of the bootstrapped values instances can be used to estimate the residual bias of the information calculation, which can be then subtracted out. The bootstrap evaluation and subtraction of the residual error can be applied to any method to compute information (such as the Direct Method explained above and the Gaussian Method which will be explained below), with or without one of the bias correction procedures described above. In our toolbox, bootstrap estimates can be computed for the quantities H(R|S), H lin (R|S), H ind (R), H ind (R|S), χ(R) and H sh (R|S) from which bootstrapped estimates of I(S; R) and I sh (S; R) are easily obtained. The remaining quantities, H(R) and H sh (R), are not affected by the bootstrapping.
The Direct Method, being based on empirically computing the probability histograms of discrete or discretized neural responses, does not make any assumption on the shape of the probability distributions. This is a characteristic which makes the Direct Method widely applicable to many different types of data.
An alternative approach to the Direct estimation of information is to use analytical models of the probability distributions; fit these distributions to the data; and then compute the information from these probability models. This method has been applied so far relatively rarely in Neuroscience (e.g. ). In fact this approach may prove difficult to apply to distributions of spike patterns since in this case appropriate analytical forms of probability distributions are usually not available. However, several situations exist for which it is possible to fit the response distribution to Gaussian functions, especially when dealing with analog brain responses and their transformations, such as Fourier transforms (see the Results section). The Gaussian Method for computing information and entropies is the one based on fitting response probabilities to Gaussian functions.
where |σ 2(R)| and | (R)| are the determinants of the matrices of covariance computed across trials and stimuli, and across trials to stimulus s, respectively.
Note that the Gaussian Method – which we implemented using a straight computation of variances which are then fed into the above equations – does not necessarily require data discretization prior to the information calculation.
The advantage of the Gaussian Method is that it depends only on a few parameters that characterize the neural response (i.e., the variances and covariances of the responses), and is thus more data-robust, and less prone to sampling bias than the Direct calculation. The potential danger with this approach is that the estimates provided by Eq. (15) may be inaccurate if the underlying distributions are not close enough to Gaussians.
and ψ(·) is the polygamma function (implemented in Matlab as the psi function).
Our toolbox allows the computation of Gaussian estimates without bias correction, with the analytical Gaussian bias correction, Eqs. (16,17), and also with the quadratic extrapolation correction QE. However, using simulated data, we found that quadratic extrapolation did not correct well for bias for the Gaussian Method (results not shown). This can be understood by noting that Eqs. (16,17) indicate that a quadratic data scaling may not necessarily describe well the bias in the Gaussian case.
The Gaussian Method can also be used to compute the terms I lin , I sig-simand I cor of the information breakdown  by approximating, H ind (R) with H sh (R), the response entropy computed after shuffling the neural responses at fixed stimulus. Note that in this case also H ind (R| S) needs to be replaced by H sh (R| S). The calculations of all these quantities is implemented in the Toolbox. Note that the quantity χ(R) cannot be easily computed with the Gaussian Method, thereby preventing the separation of I cor into I cor-indand I cor-dep.
In the following we present several case studies and tests of the performance of our ibTB Toolbox on analog neural signals, such as EEGs and LFPs. We emphasize that the toolbox can be effectively applied to spike trains as well as to EEGs and LFPs. The reason why we focus our presentation on EEGs and LFPs is that the very same algorithms implemented in our toolbox have been already illustrated and tested heavily on spike trains; therefore the illustration and test on EEGs and LFPs is more interesting. We however report that we have thoroughly tested our toolbox on spike trains. In particular, we have used our Toolbox to successfully replicate a number of previously published spike train information theoretic studies from our group, reported in Refs. [8, 13, 14, 17, 31, 36, 59].
We start by testing the performance of bias correction procedures on simulated data. These procedures have been previously tested on simulated spike trains , but not yet on analog neural signals. We therefore tested the bias corrections on realistically simulated LFPs whose statistical properties closely matched those of real LFPs recorded from V1 of an anaesthetized macaque in responses to Hollywood color movies presented binocularly to the animal (data from ). Details of the simulations procedure are reported in Appendix A. Our goal is to estimate the information (about which of 102 different 2.048 s long movie scenes was presented) carried by the two-dimensional response array made of the power of the simulated LFP at frequencies 4 and 75 Hz. The power in each frequency was discretized into 6 equipopulated bins. Choosing the boundaries of the discretization bins so that each bin is equi-populated is a simple but effective way to obtain high information values even when discretizing an analog signal into a relatively small number of bins. This is because equipopulated binning maximizes the response entropy H(R) that can be obtained with a given number of response bins .
Figure 3A also shows that, in contrast to H(R|S), estimates of H(R) are essentially unbiased. It is a very common finding that H(R) is less biased than H(R|S)  because the former depends on P(r) which, being computed from data collected across all 102 stimuli, is better sampled than P(r|s). However, in this specific case, the fact that H(R) is unbiased stems from the chosen response discretization method: by discretizing each analog LFP power response at the two considered frequencies into 6 equi-populated bins we obtain 36 equi-probable bidimensional responses which provide H(R) = log2(36) ~ 5.17.
Figure 3C shows that the plug-in estimate of information I(S; R) is biased upward: it tends to be higher than the true value for small number of trials, and then it converges to the true value as the number of trials grows. This upward sampling bias, which originates from the downward sampling bias of H(R|S) (see Eq. (1)), can be as large as the true information value if the number of available trials is low (Figure 3C). Previous research  showed that the most crucial parameter in determining the size of the bias for the plug-in estimator is the ratio between the number of trials per stimulus, N tr (s), and the cardinality of the response space, which we denote by N r . (In this particular example, the cardinality N r of the two dimensional response space equals 36). The ratio N tr (s)/N r tells us how well-sampled are the responses.Figure 3C shows that the plug-in estimator requires N tr (s) to be at least two orders of magnitude larger than N r for it to become essentially unbiased, a characteristic that makes this estimator of limited experimental utility. This is why bias correction procedures are needed.
The performance of two such procedures implemented in our Toolbox (PT and QE) is reported in Figure 3C. Both corrections substantially improve the estimates of I(S; R), which, in this simulation became accurate for N tr (s) ≈ 27 = 128 (i.e. N tr (s)/N r ≈ 3, to be compared with N tr (s)/N r ≈ 100 for pure plug-in).
Figure 3D reports the effect of subtracting bias corrections based on data bootstrapping [53, 54]. Computing the bootstrap value of the plug-in estimate and then subtracting it from the plug-in information value is not very effective: it leads to a large overestimation of the bias and thus to a big underestimation of the information. The reason why this is the case can be understood by comparing the sampling behavior of the plug-in estimator of H(R|S) before (Figure 3A) and after (Figure 3B) the data bootstrapping. For large numbers of trials, bootstrapping makes H(R|S) equal to H(R). However, the bias of H(R|S) is much greater after bootstrapping. This is because bootstrapping enhances the support and spread of the stimulus-response distributions P(r|s), which makes these probabilities more difficult to sample . As a consequence the bootstrap estimate of the bias of I(S; R) is exaggerated and its subtraction from the plug-in estimate of I(S; R) leads to a severe underestimation of the information. However, if we apply a PT or QE bias correction first, and then estimate and subtract out the remaining residual error by bootstrap, we obtain a far more effective bias removal (Figure 3D). This shows that (consistent with the analytic arguments in ), the data bootstrapping gives a good evaluation of the residual bias error after correction, though it overestimates the actual value of the bias per se.
Finally, we considered the effect of computing information through I sh (S; R) (Eq. (12) rather than through I(S; R)). Let us consider first the sampling behavior of the plug-in estimate of the four entropies that make up I sh (S; R). Because H ind (R|S) depends only on the marginal probabilities of the response array, it typically has very small bias (Figure 3A). H sh (R|S) has the same value of H ind (R|S) for infinite number of trials, but it has a much higher bias than H ind (R|S) for finite number of trials. In fact, Figure 3A shows that the bias of H sh (R|S) is approximately of the same order of magnitude as the bias of H(R|S). Intuitively, this is expected because P sh (r|s) is sampled with the same number of trials as P(r|s) and from responses with the same dimensionality [21, 36]. In this simulation, the biases of H sh (R|S) and H(R|S) were not only similar in magnitude but actually almost identical (Figure 3A). This, as explained in , reflects the fact that for the data simulated here (which represent a low and a high LFP frequency from the same electrode) the correlations among elements of the response array were relatively weak  and thus H sh (R|S) and H(R|S) were very close both in value and sampling properties. Because the biases of H sh (R|S) and H(R|S) almost cancel each other and since H(R) is unbiased, the bias of I sh (S; R) is almost identical to that of H ind (R|S). This in turn implies that the plug-in estimator of I sh (S; R) must have a much smaller bias than I(S; R), a fact clearly demonstrated by the results in Figure 3E.
Due to its intrinsically better sampling properties, I sh (S; R) has an advantage over I(S; R) not only when using a plug-in estimation but also when using bias subtraction methods. Figure 3E shows that when using PT or QE corrections, I sh (S; R) can be computed accurately even when using only 26 = 64 trials per stimulus (corresponding to N tr (s)/N r ≈ 2). When using an additional bootstrapping procedure to subtract the residual bias (Figure 3F), the estimate of I sh (S; R) becomes almost completely unbiased, independently of the bias correction used, even with as little as 25 = 32 trials (corresponding to N tr (s)/N r ≈ 1).
It should be noted that this behavior applies to cases in which (like the one we simulated) correlations among elements of the response array are relatively weak. In conditions when the correlations among elements of the response array are very strong (as it is often case with both LFPs and spikes recorded from the nearby electrodes), then the sampling behavior of I sh (S; R) is still qualitatively similar to that reported here, with the main difference that in cases of stronger correlation I sh (S; R) tends to have a small downward (rather than upward) bias. This stems from the fact that in the presence of stronger correlations the bias of H sh (R|S) tends to be more negative than that of H(R|S) , and was verified extensively on simulated spike trains in previous reports  and by increasing the level of correlations in these simulated LFPs (results not shown).
In summary, we presented the first detailed test of bias corrections procedures (originally develop for spike trains) on simulated analog neural signals. These simulations (i) confirm that these procedures are effective also on data with statistics close to that of LFPs; (ii) show that in such case it is highly advisable to use I sh (S; R) as method to compute information; (iii) indicate that evaluating and subtracting the residual bootstrap errors of I sh (S; R) (as done in ) is particularly effective. We recommend this procedure to Toolbox users interested in computing information form multidimensional LFP or EEG responses.
In this section we illustrate the Information Breakdown formalism  to study whether signal or noise correlations between the LFP powers at different frequencies made them to convey synergistic or redundant information about visual stimuli with naturalistic characteristics. For this study, we analyzed LFPs recorded from the primary visual cortex of anesthetized macaques in response to a binocularly presented naturalistic color movie . Each recording site (51 in total) corresponded to a well-defined V1 visual receptive field within the field of movie projection. From each electrode, we measured LFPs as the 1–250 Hz band-passed neural signal. Each movie was 5 min long and was repeated 40 times in order to sample the probability distribution over the neural responses to each scene. Full details on the experimental procedures are reported in Ref. . The correlation between the LFP activity in different frequency bands on this dataset was studied in Ref.  using only linear (Pearson) correlation. Here, we extend these previous results by using the information breakdown, which takes into account both linear and non-linear correlations at all orders .
We used the information-theoretic procedure to compute how the power of LFPs at these different frequencies reflects changes in the visual features appearing in the movie. We divided each movie into non-overlapping time windows of length T = 2.048 s. Each window was considered as a different "stimulus", s, and the corresponding power spectra (obtained in each trial and in each window by means of the multitaper technique [60, 61]) were considered as the neural response. From the distribution across trials of the power at each frequency and stimulus window, we computed the mutual information between the stimulus (i.e. the considered movie fragment) and the joint power of the LFP at two selected frequencies f 1 and f 2. (Thus, the response r was a two dimensional array [r 1, r 2] containing the power at frequency f 1 and f 2, respectively.) To compute information, we used the Direct Method together with the shuffled information estimator corrected with the Quadratic Extrapolation bias correction and the bootstrap subtraction (as done in Figure 3F for simulated data).
The first maximum I(S; R) occurs when f 1 and f 2 are in the low (below 12 Hz) frequency range. A second, broader maximum is present for f 1 and f 2 in the high gamma range. The highest maximum, however, is obtained when combining a low power response with an high-gamma one. An interesting question is whether the LFP powers belonging to the highly informative low-frequency range and the high-gamma range carry independent or redundant information about the stimuli, and whether any potential redundancy is due to shared sources of variability (i.e. noise correlation) or to similarities in the tuning to different scenes of the movie (i.e. signal correlation). In the following, we will use the information breakdown to address this question.
Let's first consider any two frequencies f 1 and f 2 belonging to the low frequency range. A comparison of I(S; R) (Figure 4A) and I lin (Figure 4B) shows that I(S; R) is only slightly less than I lin . Thus, as made explicit in Figure 4C, there is only a very small negative synergy (i.e. a positive redundancy) between low LFP frequencies. To understand the origin of this small redundancy, we used the information breakdown . This formalism shows that low LFP frequencies have little redundancy not because they are independent, but because they share noise correlations whose effect cancel out. In particular, there is a negative stimulus-independent correlation component I cor-ind(Figure 4E) which is almost exactly compensated by a stimulus-dependent correlation component I cor-dep(Figure 4F). Unlike noise correlations, signal correlations have a very little specific impact on the information carried by the two frequencies (because I sig-simis zero; Figure 4D). These results suggest that the low-frequency LFP bands share a strong common source of variability and thus do not originate from entirely distinct processing pathways, even though they add up independent information about the external correlates.
We then considered the case in which f 1 belongs to the the low frequency range while f 2 is in the high-gamma range. In this case, the powers at f 1 and f 2 added up independent information about the external correlates, because the joint information I(S; R) (Figure 4A) was equal to I lin (Figure 4B), and as a consequence there is zero synergy between (Figure 4C) between the low and high-gamma LFP frequencies. The information breakdown  shows that signal correlations have no impact on the information carried by the two frequencies (because I sig-simis zero; Figure 4D), and that the same applies to both stimulus-dependent and stimulus-independent noise correlations (Figure 4F and 4E). Thus, low LFP frequencies and high-gamma LFPs generated under naturalistic visual stimulation share neither noise nor signal correlations. They appear to be uncorrelated under naturalistic visual stimulation, and are thus likely to arise from fully decoupled neural phenomena.
We finally examined the case in which both f 1 and f 2 belong to the high-gamma frequency range. In this case, there is considerable negative synergy (i.e. positive redundancy) between such frequencies. This redundancy can be attributed to signal correlation (because I sig-simis strongly negative), which means that high gamma frequencies have a similar response profile to the movie scenes. The redundancy between high gamma frequencies is further enhanced by a negative effect of stimulus-independent noise correlation I cor-ind, Figure 4E. This can be explained by the results of a previously reported linear correlation analysis  which suggested a presence of a small amount of positive noise correlation between high gamma frequencies that accompanies the positive signal correlation, and with the fact that a combination of signal and noise correlation with the same sign leads to a negative I cor-ind.
It is interesting to note that the results obtained with the information breakdown are compatible with those obtained on the same dataset using a simpler linear signal and noise correlation . Since the information breakdown, unlike the linear correlation theory, is able to capture the impact of non-linear signal and noise correlations if they were present, the equivalence between linear correlation theory and information breakdown can be taken as strong evidence that the linear correlations individuated in  are a sufficient description of the functional relationship between LFP responses at different frequencies.
In this section, we illustrate the accuracy and performance on real LFPs responses of the Gaussian Method. We consider again the calculation of how the LFP power encodes information about naturalistic movies, and we use again the same set of LFPs recorded from the primary visual cortex of an anesthetized macaque in response to a binocularly presented naturalistic color movie , which were analyzed in the previous section. We computed the information I(S; R) – about which of the 2.048 s long movie scene in which we divided the movie was being presented – carried by the LFP power at a given frequency f. The response r was a scalar, r f , containing the power at frequency f. We estimated the information I(S; R) either with the Gaussian Method or with the Direct Method.
When using the Gaussian Method, we first estimated the power in each stimulus window and trial using the multitaper technique. We then took the cubic root of this power; we fitted the distribution of this response to each stimulus to a Gaussian; and we finally computed the information through Eqs. (13) and (14) subtracting the analytic gaussian bias correction, Eqs. (16) and (17). The reason for applying the cubic root transformation is that multitaper power estimates are asymptotically chi-square distributed  thus their cubic root is approximately Gaussian . The cubic root operation, being monotonic, does not affect the underlying information values of the power, but it makes response probabilities much more Gaussian and thus facilitates information estimation with the Gaussian Method. When using the Direct Method, we simply discretized these transformed power values into M equi-populated bins and computed information through Eq. (1) and corrected it for bias using QE. The number of response bins M was varied in the range 4–8 (see below).
A comparison between Gaussian and direct estimates may be useful to evaluate the effect of refining the discretization of neural responses. For this dataset, we found that when using more bins to discretize responses for the Direct Method (e.g. M = 10 or M = 12 bins), the direct information values I (d) do not change appreciably (data not shown). However, when decreasing the number of bins to M = 4 and M = 6 the resulting scatterplot of I (g) versus I (d) was again distributed along a line (like in Figure 5A) but with slopes of I (g) = 1.1 · I (d) and I (g) = 1.2 · I (d), respectively. These findings suggest several conclusions. First, differences that could be observed between gaussian and direct estimates were due to loss of information caused by poor discretization when using low number of bins for the Direct estimate. Second, using 8 response bins is sufficient to capture of the information of the LFP power and using less bins leads to very moderate information losses (of the order of 10–20%). Third, this suggests that knowledge of the second order statistics of the root-transformed power-values is sufficient for extracting the bulk of the information from the LFP power fluctuations.
To demonstrate the sampling properties and data robustness of the Gaussian information estimates, we proceed as we did previously for the Direct Method, generating realistically simulated LFPs whose statistical properties closely matched those of real V1 LFPs (see Appendix A for a description of how data were generated). This time we considered the information about the 102 presented movie sequences carried by the power of either one, two or three different simulated LFP frequencies (i.e., the response was, respectively, one, two or three dimensional). Results are reported in Figure 5B. We found that if no bias correction was used, the Gaussian information values were all upward biased, and the bias grew with the dimensionality of the response space (Figure 5B). However, using the analytic bias correction in Eqs. (16) and (17) eliminated the bias completely, with essentially identical accuracy for all considered dimensions.
Taken together, these results indicate that the Gaussian Method can be an extremely accurate and useful tool for studying the information content of analog neural signals. Because of its great data robustness, we strongly recommend its use on any neural signal whose response probabilities are consistent with Gaussian distributions.
We next demonstrate the applicability of our toolbox to the analysis of single-trial EEGs. We considered EEGs recorded from a male volunteer with a 64-channel electrode cap while the subject was fixating the screen during the repeated presentation of a 5 s-long naturalistic color movie presented binocularly. Full details on experimental procedures are reported in Appendix B. We then used our Toolbox to investigate which frequency bands, which signal features (phase or amplitude), and which electrode locations better encoded the visual features present in movies with naturalistic dynamics.
To understand which frequency bands were more effective in encoding the movie, we used a causal bandpass filter (see Appendix B for details) to separate out the range of EEG fluctuations at each electrode into distinct frequency bands (delta: 0.1–4 Hz; theta: 4–8 Hz; alpha: 8–12 Hz; beta: 12–20 Hz). We then extracted, by means of Hilbert transforms of the bandpassed signal, the instantaneous phase and power of the EEG fluctuations in each electrode, frequency band, and trial and examined the time course of amplitude and phase during the movie.
To compare the reliability of phase and power of the delta-range fluctuations at different points of the movie, we discretized the power of the delta band EEG from electrode PO8 at each time point into four equipopulated bins. We found that power was much less reliable across trials than phase (Figure 6D). As a consequence, we also found that power carried only 0.05 bits of information about the movie, and was thus much less informative than the delta phase from the same electrode.
Having illustrated the encoding of the movie by EEGs with an example recording channel and a selected EEG frequency range, we next characterized the behavior across all electrodes and over a wider range of EEG frequencies. Results are plotted in Figure 6E (phase information) and 6F (power information). We found that only low frequency ranges (delta) were informative, and that phase was far more informative than power at all electrodes. This is consistent with the attenuation properties of the skull, which is more likely to attenuate power but with relatively little introduction of phase shifts. The most informative regions were found in the right occipital parietal lobe covering the right visual cortices. It has been hypothesized that such informative low frequency fluctuations observed in visual cortex may reflect the entrainment to slowly changing informative features in the sensory signal [63, 64]. To gain some insights on why EEGs recorded from the right hemisphere were far more informative about the movie than those recorded from the left hemisphere, we calculated the mean pixel luminance of the movie clip over the 5 s stimulation window separately for left and right hemifields. We found that the mean luminance was greatest in the left visual field as compared to the right. This provides one potential reason to explain the lateralization of information about the movie observed in this subject.
This example demonstrates the capabilities of the information analysis to extract the most informative components of EEG signals even when using complex dynamic stimulation paradigms and illustrates the potentials of this toolbox for single-trial EEG analysis.
In order to allow users to familiarize with the Toolbox, we have included (as Additional File to this Article) the entire dataset of EEG Delta Phases for all 64 channels and all trials, together with a commented script that loads the data and computes information through the appropriate calls to the Toolbox (Additional file 2: eegtest.zip). Running the script contained in the afore mentioned file outputs the results plotted in top left plot of Fig 6E.
Other groups have developed, or are currently developing, toolboxes for the information analysis of neural responses. Here we briefly discuss some of the relative features of current releases of other information theoretic toolboxes, and their complementariness.
Ince and colleagues  recently released an information theoretic toolbox for neuroscience data called pyentropy based on the Python programming language . This toolbox has the so far unique feature of including an advanced and memory-efficient algorithm for the computation of entropies which are maximal under given constraints, thereby allowing an easy calculation of many of the "maximum entropy" quantities which have received substantial attention in recent years for the study of neuronal interactions [67–69]. The choice of Python as a programming language comes with several advantages provided by its open source nature, its flexibility, and its very efficient use of memory. The use of Python could however be problem for most experimental neuroscience laboratories, which currently make use of Matlab for preprocessing, analysing and plotting the data.
Another available information theoretic toolbox for spike train analysis is the Spike Train Analysis Toolkit (STAToolkit) [70, 71]. STAToolkit is, like our toolbox, based on C-MEX technology and like ours can be easily used in Matlab by experimental neuroscience laboratories. A unique and important feature of STAToolkit is the large number of estimation methods for the information carried by spike trains, including techniques such as the binless estimation  and the metric space approach .
With respect to the two above toolboxes, our new toolbox presents two distinctive features. First, it is the only package which has been tested heavily non only on spike trains but also on analog brain recordings such as LFPs, and EEGs. It also includes algorithms which are specific for these signals, such as the Gaussian Method information calculation and its bias correction. The second distinctive features of our toolbox is the speed of computation. This speed advantage is not only due to the C implementation, but also to the new algorithm for fast entropy calculation that we presented here. By comparing systematically the speed of our toolbox on simulated data with the speed of pyentropy , we found that our toolbox has a speed advantage of typically an order of magnitude to the one of pyentropy. We found similar speed advantage in comparison to STAToolkit.
This paper accompanies the first release of ibTB, which we will continue to be developed over the coming years. Some features that we are working to implement in future releases include:
Additional bias corrections procedures. Currently, we implemented some of the best known and most useful bias correction procedures for the computation of mutual information. Other important corrections exist (e.g. [34, 35]), however, which we plan to implement and include in the toolbox in the near future. Additionally, starting with the next release of ibTB, users will be given the opportunity to plug-in their own custom bias correction routines linking them very easily to the main routines in the toolbox.
Additional methods. The Gaussian Method is one of the many analytical procedures existing for the computation of entropy and mutual information: actually, several other methods are available which take into account other probability distributions . The modular structure of the toolbox allows to very easily add new methods to the toolbox: these will be gradually introduced with future releases.
fMRI analysis. We are currently in the process of testing and adapting our toolbox to its use with BOLD fMRI data. Although we developed  the bias procedure used successfully in recent information analysis of fMRI data some papers [26, 75], more work is needed to understand the specific problems caused by the statistics of fMRI data, and how best to use information theory to detect voxels significantly tuned to the stimuli . We plan to report thorough studies of this issues on the toolbox website as soon as possible.
Neuroscientists can now record, simultaneously and from the same brain region, several types of complementary neural signals, such as spike, LFP, EEG or BOLD responses, each reflecting different and complementary aspects of neural activity at different spatial and temporal scales of organization. A current important challenge of computational neuroscience is to provide techniques to analyze and interpret these data [27, 76–81]. We believe that the new fast information theoretic Matlab Toolbox presented here offers a useful technology tool to analyze these complementary brain signals and understand how the brain may combine together the information carried by aspects of neural activity at these different levels of its organization.
Project name: Information Breakdown ToolBox
Project home page: http://www.ibtb.org
Operating system: tested on Mac OS X, Windows 32 and 64 bits, Linux
Programming language: Matlab (toolbox tested on R2008a and successive releases) and C
Other requirements: Microsoft Visual C++ 2008 Redistributable Package x86 (or x64) for use on Windows 32 bit (or 64 bit) machine. The package is freely downloadable from Microsoft's website and is only required if Visual C++ is not installed.
Licence: ibTB is distributed free under the condition that (1) it shall not be incorporated in software that is subsequently sold; (2) the authorship of the software shall be acknowledged and the present article shall be properly cited in any publication that uses results generated by the software; (3) this notice shall remain in place in each source file.
Any restriction to use by non-academics: none.
We simulated the LFP power of a recording site in primary visual cortex (V1) in response to many different movie scenes. In brief, data were simulated as follows. We selected from the dataset of  a given example recording channel (channel 7 from animal D04), and we computed multitaper estimates of the power at three chosen frequencies (4, 25 and 75 Hz) in response to approximately 2-s-long scenes of Hollywood color movies presented binocularly to the animal. The multitaper technique allows to reduce the variance of the spectral estimates while keeping the bias under control: this is achieved by means of taking the average of different direct spectra computed using tapers which are orthogonal to each other (see  for more details). The maximum number of averaged spectra is a free parameter (named K) which is set by the user. Here we chose K = 3, thereby providing power estimates which are distributed approximately as a chi-square with 6 degrees of freedom. We then applied Wilson and Hilferty's cube-root transformation : this transformation, being monotonic does not affect the information content of the responses while making the response-distributions to a fixed movie scene approximately gaussian (a fact that we also verified empirically). We use the same approach for simulation of multi-dimensional responses, by assuming that the joint distribution of the root-transformed power at two or three different frequencies during each fixed movie scene was a multivariate Gaussian. We generated many instances of this Gaussian power-responses by means of Matlab's mvnrnd function using mean and standard deviation values which were computed, for each scene, from the real data. For entropy estimates computed using the Direct Method, the data simulated in this way have been further discretized into 6 equi-populated response bins.
The EEG was acquired using a 64 channel electrode cap (BrainAmp MR, BrainProducts). Electrode placement followed the International 10–20 System and electrodes were all referenced to a frontal central electrode (FCz). Electrode impedances were kept below 15 KOhms. Horizontal and vertical eye movements were recorded using an electro-oculogram (EOG) with electrodes placed over the outer canthus of the left eye as well as below the right eye. Subjects were comfortably seated in a dimly lit room. EEG recordings were digitally recorded at 1000 Hz with a bandpass of 0.1–250 Hz and stored for offline analysis. A small fixation cross on black background was shown in order to indicate the beginning of the trial. After 2 seconds of fixation, a 5 second movie segment (full field) was presented, followed by 2 seconds of continued fixation, resulting in trials totaling 9 seconds of fixation. A movie clip, consisting of fast moving and colorful scenes from a commercially available movie, was presented 50 times. All data analysis procedures were implemented with the Matlab programming language in combination with the EEGlab analysis toolbox  as described below. Postprocessing was performed using the EEGlab analysis software (Neuroscan). EEG epochs (-1000 to 5000 ms temporal range) were created based on the onset of triggers recorded during the recording session. An EOG artifact correction algorithm was used to remove all trials with amplitudes that exceed ± 75 mV. After artifact rejection, 30 movie presentation trials remained.
To obtain bandpassed EEGs from each electrode, we bandpassed the raw EEG signal sampled at 1 KHz with a zero-phase-shift Kaiser filter with sharp transition bandwidth (1 Hz), very small passband ripple (0.01 dB), high stopband attenuation (60 dB), and bandwidth corresponding to the considered band (e.g. 0.1–4 Hz for delta; 4 – 8 Hz for theta, etc. – see main text). These filters were exactly equal to those used for LFPs in Refs [13, 14]; we refer to these references for more details.
functional magnetic resonance imaging
Local Field Potential
Information Breakdown ToolBox
We thank C. Kayser, A. Belistki, R. Ince, N. Ludtke, A. Mazzoni, F. Montani, M. Montemurro and G. Notaro for many useful discussions and for testing the code, and E. Molinari for useful discussions. This research was supported by the BMI project of the Department of Robotics, Brain and Cognitive Sciences at the Italian Institute of Technology, and by the Max Planck Society.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.