### Experiment 1: Poisson model simulation

Here, we examined the performance of the functional connectivity analysis methods using a dataset generated by simulating a network of GLM-based spike response models. We designed an artificial neuronal network of 15 neurons (Fig 1a) consisting of three groups of neurons, G1, G2, and G3, each of which consisted of five neurons. This is a recurrent network: G1 \(\rightarrow\) G2 \(\rightarrow\) G3 \(\rightarrow\) G1. From G1 to G2, there are ten excitatory connections among 25 pairs of pre- and post-synaptic neurons. Similarly from G2 to G3, there are ten excitatory connections among 25 pairs, and from G3 to G1, there are ten inhibitory connections among 25 pairs. There is no connection between neurons belonging to the same group. The activity of each neuron was simply determined by a GLM-based spike response model, Eqs. (1) and (2), with the complementary log-log link function that simulates a Poisson spike model. Each neuron also received an external white Poisson input of around 50–150 Hz independently from the external input to the other neurons. This assumption that the external input was independent between neurons is slightly different from the one assumed by our GLM model, in which the external input is shared by the observed neurons. However, as this situation of independent Poisson inputs is much simpler, our GLM-based spike response model can deal with it by just removing the estimation procedure of the external input (“Estimation of external input” section). The activities of all 15 neurons were observed at a sampling frequency of 200 Hz. With these settings, we obtained from 0.6 to 2 spikes on average per neuron per 1 s (200 observation time-samples).

First, we compared results of the sparse and non-sparse estimation methods for GLM. We selected spike train data of 2000 and 10,000 samples from the simulated data for training and 10,000 samples for validation. The sample numbers in both cases are considerably shorter than some typical cases; as a reference, Kim et al. [7] used 50,000 samples to demonstrate their method. When using 50,000 samples, we found our method achieved 100 % accuracy, although the results are not shown here.

We compared two types of regularization, group lasso (gl) and L2 (l2) for GLM, corresponding to sparse and non-sparse estimation, respectively. We also used L1 regularization, but did not show its results because the performance with L1 was comparable to that of group lasso. We also compared two link functions, logistic (lr) and Poisson (pr); here *Poisson* stands for the complementary log-log link function that simulates Poisson spikes. The SRFs were estimated using the training data with various hyperparameters that determine the strength of regularization. The estimated SRFs were evaluated with two criteria, detection accuracy of the true functional connectivity and fitness to unseen data; detection accuracy is shown as a receiver operating characteristic (ROC) curve and the area under the curve (AUC) score, whereas fitness is given by the log-likelihood for the 10,000 samples for validation. In sparse estimation like with group lasso regularization, some statistics like *Peak* may be exactly zero for some connections. Although a naive threshold to detect functional connectivity based on such statistics would be zero, setting a non-zero threshold can in fact produce better functional connectivity estimations. Therefore, we evaluated ROC and AUC by changing the threshold value even for sparse estimation methods.

Figure 1 shows the results. In all aspects, logistic and Poisson link functions did not show any significant differences whereas the difference between sparse and non-sparse estimation was fairly large. When the hyperparameter was determined to increase the fitness to the validation data, the connectivity detection performance was poor as shown in the ROC curves (b) and (e). As shown in the panels (c) and (f), the best hyperparameters in terms of the detection accuracy or AUC were smaller than those for the largest fitness (marked). The AUC could be made larger by arbitrarily setting the hyperparameter to a smaller value. With respect to the fitness, the best model was achieved by group lasso in both sizes of training data, 2000 and 10,000 as shown in (d) and (g), respectively. With respect to the detection accuracy, the best AUC with the L2 regularization (AUC=0.88) was better than that with group lasso (AUC=0.85) for \(T=2000\), and the best AUCs were 1.0 with both L2 and group lasso for \(T=10{,}000\) as shown in (c) and (f), respectively. More importantly, L2 regularization performed well for a wide range of hyperparameters, while group lasso with non-optimal hyperparameters performed noticeably worse than the best performance.

We then compared the results obtained by the three statistical tests, **CHI2**, **DOF**, and **EB**, described in “Statistical tests for functional connectivity analysis” section. The GLM-based spike response model was estimated to maximize the L2 regularized log-likelihood (6). When performing the **CHI2** and **DOF**, the regularization hyperparameter was set at \(\eta =0.001\). When performing the **EB**, we compared two cases **EB-LR** and **EB-regularized-LR**, in which the regularization hyperparameter was set to \(\eta =0.001\) and \(\eta =3.0\), respectively. The value \(\eta =0.001\) was arbitrarily determined to obtain similar results to those of \(\eta =0\) but with computational stability. The value \(\eta =3.0\) was also arbitrarily set to a small positive value, because there was no predominant way to set it for enlarging the detection power. We also used smooth bases in the **EB-regularized-LR**.

Figures 2 and 3 show the results. The upper panels show the ROC curves with AUC scores in the title, in both of which the higher and the larger are the better ones, respectively. In the lower panels, the false discovery proportion (FDP) is shown for each q value threshold on the horizontal axis; FDP is the proportion of false positives (detected as functionally connected, but where there was in fact no direct connection) in such links that were detected as functionally connected because the corresponding q values were smaller than the threshold on the horizontal axis. The lower panels evaluate the quality of the q value estimation. Because a q value is defined as the FDP estimation, when the FDP is smaller than the q value threshold, the corresponding q value estimation is said to be conservative. From the upper panels of Fig. 2, the ROC curves and the AUC scores of the proposed method (**EB-regularized-LR**, red lines) consistently outperformed those of its un-regularized version (**EB-LR**, blue line), and such a benefit of regularization was prominent especially when the observation length *T* was small. In the lower panels of Fig. 2, the conservativeness of the q value estimation was compared among **EB-LR** (blue solid lines), **CHI2** (blue dotted lines), **DOF** (blue broken lines), and the proposed **EB-regularized-LR** (red solid lines). If a line is close to the diagonal \(x=y\) line, the corresponding q value estimation is faithful, and if a line is below the diagonal line, the q value estimation is conservative. We prefer q value estimation that is faithful but conservative; that is, we do not prefer a line that goes over the diagonal line. We found that the empirical Bayesian q values (blue solid lines) were stably good, while a classic chi-square test without DOF adjustment (**CHI2**; blue dotted lines) almost failed to perform false positive control when the observation length *T* was small. When the observation length *T* was large enough, the results depicted by the blue lines were consistent with those reported by [7]. In the lower panels, the FDP values sometimes went over the diagonal line, violating the conservativeness. This violation comes from the estimation variance of q values; a large variance in the q value estimation may arise especially when there is significant correlation between hypotheses, which is the case in the functional connectivity analysis of neuronal networks.

Figure 3 shows the results of statistical tests employing shape-related statistics (“Shape-related statistics” section). We did not find any substantial differences either in total detection accuracy (upper panels, evaluated in terms of ROC) or false positive control (lower panels, evaluated in terms of faithfulness and conservativeness of the q value estimation) among the statistics of the *Peak* (red lines), the *MaxZ* (blue lines), the *Surface* (green lines), and the *MD* (magenta lines). It is noteworthy that we numerically confirmed the stable false discovery rate control regardless of possibly different null distributions of various test statistics. This result may allow us to select arbitrary test statistics in the proposed empirical Bayesian testing framework. We may expect good ROC by selecting a relevant test statistic that fits particular shapes of true spike response functions although the ROCs were similar between different statistics in the current case partly because we assumed simple spike response function shape such that caused no contradiction between test statistics that we compared here. The ROC curve of *MaxZ* statistic included 95 % CI criteria as a special case (see “Statistical tests for functional connectivity analysis” section), which is specified as a marker ’o’ on the curve (upper panels in Fig 3). We found that the 95 % CI provided a moderate balance between sensitivity and specificity in these cases. However, note that the 95 % CI had no guarantee of conservativeness that EB could provide as q value.

### Experiment 2: a network of non-linear conductance-based neurons

Next, we applied functional connectivity analysis to a model mismatch case. A spike train dataset was obtained by simulating a recurrent network of non-linear conductance-based neurons. Each neuron was modeled by single-compartment Hodgkin–Huxley (HH) equations [25], consisting of sodium, potassium, and leak channels. Thus, the neuronal activities did not follow GLMs. We, however, expect that the proposed false positive control procedure works reasonably well because our framework mainly depends on the empirical null distribution of test statistics built upon a set of surrogate neuron pairs and hence does not depend on the GLM’s fitness.

The simulated recurrent network resembled that in Experiment 1: three layers, each of which had five HH-type neurons. The simulation was conducted using the NEST simulator [26, 27], and the parameters of the HH neurons were set by the default setting of the NEST simulator.

The results are shown in Fig. 4, in which similar behaviors of the statistical tests can be seen to those in Experiment 1. The q value estimation by chi-square tests, **CHI2-LR** (blue broken lines) and **DOF-LR** (blue dotted lines), was substantially poor even when observation length *T* was large. When employing non-linear neurons like the HH-type ones, GLM could no longer well represent their non-linear behaviors, and the model of the alternative hypotheses would contain bias, which was not resolved even after very long observation. Although the GLM-based spike response model had bias, the empirical Bayesian testing with regularized model estimation, **EB-regularized-LR** (red lines), showed reasonably good ROC curves and AUC values (upper panels) and faithful estimation of the q values (lower panels), based on an empirical null distribution constructed by the surrogation method.

### Experiment 3: partial observation case

We applied the proposed method to a partial observation case involving external input. We generated an artificial spike train by a simulation of 400,000 time samples using the same artificial neural network as the ones in Experiments 1 and 2 again. In the simulation, we added an artificial external input signal (Fig 5a) to all neurons, where connection weight from the external input to the neurons were set at random positive values.

In order to consider partial observation, we regarded 7 out of 15 neurons were considered as non-observed neurons, and spike response functions between 8 observed neurons, namely those of 8 times 8 pairs of pre- and post-synaptic neurons, were estimated by our procedure.

The external input was estimated by applying moving average filter of window size 20 and PCA. In Fig 5a, we show the estimated external input denoted as red line. We compared AIC values of seven models with 0,1,..., and 6 external inputs, and found that the model with one external input was the best (Fig. 5b). The estimated spike response functions are shown in Fig 5c, where those function shapes corresponding to the 8 times 8 directed neuron pairs were divided into the following four cases: (0) not connected, (1) directly connected, (2) connected via one interneuron, and (3+) connected via two or more interneurons. In addition, we also show spike response functions corresponding to a pair of a surrogate and any observed neurons, that is called a category (surrogate).

The spike response functions were estimated as significant (\(q<0.3\)) for a large part of direct connection, red lines in case (1) in Fig 5c. On the other hand, the estimated spike response functions of indirect connections, cases (2) and (3+), were very weak and difficult to be distinguished from those of cases (0) and (surrogate). The results were identical when the number of principal component was set at zero, two, and more (data not shown).

### Experiment 4: spike frequency, connection weight, and detection accuracy

How do different spike frequency and connection weight affect the detection accuracy in case of information shortage? Does the proposed framework keep its advantage against sparse estimation methods in these cases?

We generated an artificial spike train of 2000 time samples by a simulation of a simple network involving three neurons, 1, 2, and 3. The network had two directed facilitative connections from 2 to 1 and from 2 to 3, whose connection weights, or peaks of true spike response functions, were 1.0 and *w*, respectively. Here we compared six values of *w* in \(\{0, 0.003, 0.001, 0.03, 0.01, 1\}\). All the three neurons had the same background activity level *A*, that was \(R_{i0}\) in equation (2), which controls total spike frequency. We compared six values of *A* in \(\{1.0,1.5,2.0,2.5,3.0,3.5\}\). As the network size is small, we repeated the same simulation 50 times with different random seeds and averaged the performances in order to draw stable ROC curves.

We compared two GLM estimation methods with the L2 regularization with an arbitrary setting of \(\lambda =0.1\) and with the variational Bayesian (VB) estimation [15]. VB approximates a hierarchical Bayesian model with an automatic relevance determination (ARD) technique that determines sparse solution so as to optimize model fitting. For test statistics, we compared *Peak*, as a representative of the simplest cases, and *MaxZ*, as a case that is equivalent to the application of 95 % confidence interval [15].

Figure 6 shows the results of Experiment 4. The 36 panels correspond to the \(6 \times 6\) variations of simulation setting parameters, *w* and *A*. Four ROC curves in each panel illustrate detection powers of the connection from 2 to 3 of weight *w* by the four methods, L2 with *Peak*, L2 with *MaxZ*, VB with *Peak*, and VB with *MaxZ*. In total, we found that L2 prominently outperformed VB in most of the cases. We also found that larger *A* tend to improve the detection accuracy as we expected. Interestingly, change of connection weight *w* did not affect much to the detection accuracy except for the case showing chance level \(w=0\). Test statistic *Peak* performed a little bit better than *MaxZ* indicating non-omnipotency of the confidence interval criteria although we could not conclude general superiority of *Peak* statistic, neither.

Note that the prominent performance of L2 against those of VB is consistent to the result of Experiment 1. It, however, does not stand for the general inferiority of VB. Further discussion with general comparison will be found in “Alternative options in smoothing and regularization” section.

### Analysis of calcium imaging data

We applied functional connectivity analysis with GLM-based spike response models to a real calcium imaging dataset. A high-throughput calcium imaging experiment was performed on a cultured slice of the CA3 region of rat hippocampus [11]. Using the Nipkow disc, the sampling frequency was as high as 100 Hz, and the number of time samples was \(T=60,000\); that is, the observation time was 10 min. During preprocessing, 170 regions of interest (ROIs), each of which showed its individual activity, and the corresponding time-courses of the fluorescence level were obtained. A visual inspection found that each ROI mostly corresponded to a neuron. Then, for each fluorescence time-course, we detected the peak timings by applying a fixed template that represented a calcium profile associated with a single spike. A single detected peak may not correspond to a single real neuronal action potential; actually, there were some cases in which only one spike was detected for burst-like calcium activity. We simply regarded such burst activity as a single spike because their number was small. More detail on the above preprocessing is shown in the Additional file 1. We selected 60 neurons out of the 170 ROIs that showed frequent spike activities and found that the highest spike frequency was 20 times larger than the lowest one within them. After this preprocessing, we evaluated the functional connectivity between the 60 neurons and their statistical significance using our method empirical Bayesian testing after the regularized parameter estimation of the GLM-based spike response model.

For each neuron pair, the spike response function of at most an 80 samples (800 ms) time-lag was estimated. We prepared ten surrogate neurons and assumed two channels of external inputs, which were estimated by applying PCA analysis (“Estimation of external input” section) to all of the 60 neurons’ activities. We found that the false positive control based on shape-related statistics worked as well as the likelihood test statistic, so the q value threshold was determined using the *Peak* statistic (“Shape-related statistics” section).

Figure 7 shows the result of the functional connectivity analysis, in which 43 probable directed connections are shown by arrows. The number of connections (43) was determined so that the estimated false discovery rate was 0.2. In this figure, some neuron pairs connected by red arrows (putative facilitatory connections) and being located close to one another may be single neurons segmented into two parts (ROIs), such as soma and axon hillocks. Even if such pairs are removed, the detected functional connectivity still tends to connect two neurons near each other. There were five neuron pairs connected by blue arrows (putative inhibitory connections) being located distant to each other.

Statistical properties of the proposed empirical Bayesian testing for this dataset are shown in Fig. 8. The probability distribution of a test statistic is a mixture (a weighted sum) of unknown null and alternative distributions, which would correspond to the left mass and the right tail of the density function shape in panel (a), respectively. The surrogate distribution of the test statistic (green curve) fits well to the left mass part of the empirical distribution (blue curve), indicating that the surrogate distribution well simulated the unknown null distribution. The local false discovery rate (panel b) shows that the estimated total proportion of null connections was \(\pi _0=0.83\) and that the local false discovery rate was smaller than 0.2 when the statistic was larger than 2.5. The significant pairs of neurons are shown as white dots in the source/destination matrix (panel c), so that their degrees of significance are seen by comparing to the background noise level in the surrogate area of the matrix.

The spike response functions of significant functional connectivity are shown in Fig. 9 in ascending order of q values. Many response functions of the top significant functional connectivity had a positive peak at lag-times of around 8 samples (\(\approx 80\) ms), and some response functions included a tail around the 50th sample (\(\approx 500\) ms), which can be consistently considered as EPSPs possibly mediated by AMPA and NMDA receptors (see Discussion for details). Moreover, obvious continuity of response functions between pre-post connections, \(R_{ic}(s)\), and their reverse ones, \(R_{ci}(s)\), was found for many pairs of neurons, (*i*, *c*), in Fig. 10. This continuity (at \(s=0\)) occurs partly because it reflects the correlation of neurons *i* and *c*. Such correlation might have been overestimated due to jitter in spike timing detection; in some cases, neuron *i* emits a spike slightly earlier than that of neuron *c*, and in other cases, neuron *c*’s spike can be earlier, because of the jitter in the binning.