Example data: neurophysiological correlates of phonemic identification
We validated the RoLDSIS technique using data from an experiment that addressed categorical perception. Perceptual categorization involves neuronal mechanisms accounting for the transformation of lower-level sensory inputs, which capture the continuous properties of the stimulus, into a higher-level conceptual representation, which is composed of discrete classes or categories. This is the case, for example, of the perception of colors and facial emotions [12]. Categorical perception also happens in speech, where sounds with continuous physical attributes are mapped onto discrete perceptual classes, in a process called phonemic categorization [13].
Over the past decades, categorical perception in speech has been studied from both the behavioral and theoretical points of view [14]. More recently, improvements in technologies for brain activity measurement, as well as the availability of computational power, have allowed the investigation of the neurophysiological mechanisms underlying this phenomenon. Using direct electrode recordings of patients undergoing preoperative surgery, Chang and colleagues [15] recorded the cortical responses to the phonemic continuum /ba/-/da/-/ga/ in the secondary auditory cortex. By applying pattern recognition techniques, they showed that neuronal activity mirrors perception, demonstrating that categorical representation arises around 110 ms after stimulus onset. Using a less invasive acquisition system (EEG), Bidelman and colleagues [16] investigated the emergence of categorical perception for the phonemic continuum /u/-/a/. By analyzing the frequency band involved in the N1–P2 complex at the temporal scalp region, they showed that the neurophysiological correlates of phonemic categorization emerges around 175 ms after stimulus onset. They also showed that physical properties of the stimulus (vowel formants, in their case) are encoded in the early, high-frequency bands of the auditory response, probably coming from sub-cortical regions. Using fMRI and MEG during syllable identification, Bouton and colleagues [17] showed that activities related to sensory and categorization processing happens in a restricted part of the posterior superior temporal gyrus. They also showed that neuronal activity in this region reflect the syllable identification errors.
These studies indicate that it is possible to investigate the neurophysiological correlates of phonemic categorization. However, none of them tried to infer the neuronal representation of both the physical (\(\phi\)) and psychophysical (\(\psi\)) attributes of the stimuli directly from the full set of features available in the evoked auditory responses. This is a situation that is particularly well suited for the application of the RoLDSIS procedure. In the subsections that follow, we describe a phonemic identification experiment using EEG, whose data will be used to validate the new regression technique proposed in the present paper.
Participants
Eleven participants, five males and six females, aged 28 years on average (SD 9 years), participated voluntarily in the experiment. All participants were native speakers of Brazilian Portuguese. They were all right-handed and presented an average grade of 76.8 for the right hand, according to Oldfield’s laterality index [18]. None of them had any history of neurological, language, or auditory disorders. All participants were previously informed about the procedures and tasks of the experiment and provided written informed consent to participate in the study. The experiment was approved by the Ethics Committee of the Federal University of Minas Gerais (COEP-UFMG, Brazil).
Auditory stimuli
A continuum of sounds was created between the syllables /da/ and /ta/, which were recorded from a male speaker of Brazilian Portuguese, who uttered both syllables in isolation in a natural setting inside a sound-proof booth. Both syllables finish with the open front unrounded vowel /a/. The initial consonants are the alveolar stops /t/ and /d/. These consonants have the same point of articulation at the front of the vocal tract, but differ in voice onset time (VOT), which is the amount of time between the occlusion release and the voicing arising from the vibration of the vocal folds (negative for /da/ and positive for /ta/) [19]. A morphing procedure was used to generate the 200 intermediary, synthetic stimuli of the continuum. These stimuli were created by continuously varying the onset of the voicing murmur of /da/, from −52 ms to 0 ms. In all cases, the release burst is present, resulting in a \(+\)16 ms VOT value for the extreme /ta/ syllable. Each stimulus was saved to an audio WAV file. The reference time was chosen to be the beginning of the stationary part of the vowel, such that the beginning of the WAV file corresponds to \(t=-\)74 ms in the original stimuli. Thus, the stationary part of the vowel for all stimuli was temporally aligned, in relation to the beginning of the WAV file. The duration of the WAV files is 220 ms.
Identification task
The participants were tested in a preliminary phonemic identification experiment, in which the stimuli were presented in random order through earphones. The participant’s task was to identify the perceived syllable (/da/ or /ta/) in a forced binary choice task. The results of this experiment are shown in are Fig. 2 for a representative participant, where a psychometric, logistic curve was fitted to the participant’s responses using the glmrob function of the R software [20]. In this figure, the values 0% and 100% correspond, respectively, to the identification of the /da/ and /ta/ syllables. For the subsequent EEG experiment, the stimuli corresponding to 0%, 5%, 50%, 95%, and 100% of the psychometric curve were selected. Hereafter, these stimuli are called #1, #2, #3, #4, and #5, respectively. Stimuli #2 and #4 are closer to stimulus #3 from the acoustic (physical) point of view (abscissa axis), and closer to the extreme stimuli #1 and #5, respectively, from the perceptual (psychophysical) point of view (ordinate axis).
EEG experiment
Each participant was subsequently tested in an EEG experiment, where the five selected stimuli were presented in random order, 200 times each. The participant was asked to perform the same phonemic identification task as done previously. We recorded the activity of the electrodes placed at the vertex of the head and on the mastoid bone, behind the left ear. These are the placement of electrodes typically used in the study of speech processing in the central nervous system, including the auditory brainstem response (ABR) and the responses in the auditory cortex (temporal lobe) [21]. This choice of placement of the electrodes should then allow the recording of the underlying neuronal activity used both in phonemic feature processing and categorization [15, 22]. The electric potentials between these two electrodes were acquired with passive gold cup electrodes connected to an RHD2000 acquisition board (Intan Technologies) and sampled at 5 kHz. The signal was epoched by stimulus response and time-locked to the onset of signal in the WAV file. The response in each trial was baseline corrected using a 150 ms pre-stimulus time. Excessively noisy trials were removed by visual inspection, what resulted in around 91.5% of the trials being kept for subsequent analysis. Each trial has 2048 time samples (lasting for around 0.4 s). Figure 3 illustrates the ERPs for each stimulus for a representative participant. The SNR of the raw signal was estimated from the ERP to be between − 12 and − 15 dB.
Feature extraction
Discrete wavelet transform (DWT) was applied to each trial, in order to obtain its representation in the time-frequency domain [23]. The DWT is obtained by filtering the signal through a series of high and low-pass filters in a recursive filtering and downsampling process. Since the DWT is a linear and orthogonal transformation, it provides a parsimonious representation of the signal being transformed, with each resulting coefficient associated with a specific location in the time-frequency domain [24]. Furthermore, the DWT performs the decomposition of the signal into frequency bands, what is useful for the analysis of rhythmic patterns of neuronal activity in the central nervous system. In our case, the DWT yields a set of 2048 coefficients, organized in blocks. The first block contains the so-called approximation coefficients (V) and comprise a low-pass filtered representation of the signal. The remaining blocks contain the detail coefficients (W), which comprises the high frequency information. These coefficients are obtained by convolving the signal with a band-pass filter based on the mother wavelet [25]. We used as mother wavelet the Daubechies orthonormal compactly supported wavelet of length 8, from the least asymmetric family, available in the package wavelets of the R software [26]. Only the DWT coefficients corresponding to the low frequency bands (between 0 and 156 Hz) were retained, resulting in a feature vector of length 128. This range of frequencies covers the bands \(\uptheta\), \(\upalpha\), \(\upbeta\), and \(\upgamma\), which are of interest in brain electrophysiology studies of speech perception [27, 28]. The feature vectors were averaged across trials for each participant and each stimulus. The RoLDSIS technique was then applied to these averaged observations \({\mathbf {x}}_i, i = 1,\ldots ,5\).
Application of the RoLDSIS technique
Linear regression on physical and psychophysical attributes
As explained above, each stimulus i is associated with both a specific physical attribute \(\phi _i\) (VOT value of the associated stimulus) and with a specific psychophysical attribute \(\psi _i\) (the proportion of /ta/ responses of the associated stimulus, obtained from the psychometric curve). For the psychophysical attribute, we used the proportions of /ta/ responses corresponding to the selected stimuli \(\psi _1 = 0.0\), \(\psi _2 = 0.05\), \(\psi _3 = 0.5\), \(\psi _4 = 0.95\), and \(\psi _5 = 1.0\). For the physical attributes, we used the VOT of the selected stimuli. The first (\(\phi _1\)) and last (\(\phi _5\)) values were equal for all participants and corresponded to the /da/ and /ta/ stimuli at the beginning and at the end of the continuum (−52 ms and \(+\)16 ms, respectively). The other three values varied for each participant, since the psychometric curve is idiosyncratic. For instance, for the participant whose psychometric curve is depicted in Fig. 2, the physical attributes were \(\phi _2 = -35\) ms, \(\phi _3 = -22\) ms, and \(\phi _4 = -8\) ms.
We hypothesize the following linear relationships, for \(i = 1,\ldots ,5\):
$$\begin{aligned} \phi _i&= a_\Phi + {\mathbf {b}}_\Phi ^\intercal {\mathbf {x}}_i, \end{aligned}$$
(17)
$$\begin{aligned} \psi _i&= a_\Psi + {\mathbf {b}}_\Psi ^\intercal {\mathbf {x}}_i, \end{aligned}$$
(18)
where \({\mathbf {x}}_i, \; i = 1,\ldots ,5\), are the observation vectors obtained from the DWT analysis described above. We assume that \({\mathbf {b}}_\Phi\) and \({\mathbf {b}}_\Psi\) are unit vectors (ie \(||{\mathbf {b}}|| = 1\)). Since \({\mathbf {x}}_i \in {\mathbb {R}}^{128}\), vectors \({\mathbf {b}}_\Phi\) and \({\mathbf {b}}_\Psi\) have 127 free coefficients to be determined. Considering also the scalar parameters \(a_\Phi\) and \(a_\Psi\), each equation above results in a system of 5 linear equations with 128 unknowns.
The solution can be found using the RoLDSIS technique. The 128 coefficients of vectors \({\mathbf {b}}_\Phi\) and \({\mathbf {b}}_\Psi\), called RoLDSIS loadings, can be represented in the form of a scalogram, which is a time-frequency representation, similar to the one used in [24]. This is depicted in Fig. 4. Notice that the representation space for \({\mathbf {b}}\) is identical to that for \({\mathbf {x}}\), i.e. the \({\mathbb {R}}^N\) feature space. In the scalograms, the magnitude of each coefficient is encoded by the color saturation, such that the paler the color, the closer the coefficient is to zero. The sign of the coefficient is encoded by the color, red and blue meaning negative and positive values, respectively. The vectors \({\mathbf {b}}\) can then be transformed into the time domain using THE inverse DWT. The associated time profiles for \({\mathbf {b}}_\Phi\) and \({\mathbf {b}}_\Psi\) are shown on the top of the respective scalograms in Fig. 4.
Projections onto physical and psychophysical directions
The vectors \({\mathbf {b}}_\Phi\) and \({\mathbf {b}}_\Psi\) obtained by the RoLDSIS procedure can be interpreted as specific directions in the feature space. These directions would then indicate a sort of “canonical” representation of the neuronal activity that is associated with the variation in the stimulus attribute, either physical or psychophysical. The varying response along these directions can be represented in the time domain as in Fig. 5. Each curve in the figure is obtained by projecting the original observations \({\mathbf {x}}_i\) onto the respective direction and by using the inverse DWT to obtain the associated time profile.
Relationship between \({\mathbf {b}}_\Phi\)–\({\mathbf {b}}_\Psi\) divergence and the degree of categorization
In order to assess the relevance of the results obtained by the RoLDSIS technique, we computed the angle between the obtained physical and psychophysical directions. The value of this angle is specific to each participant and represents the separation between the neuronal representations of the two attributes. This angle can vary between 0\(^\circ\) and 90\(^\circ\), where a 0\(^\circ\) angle means the physical and psychophysical representations are indistinguishable from each other whereas a 90\(^\circ\) angle means they are uncorrelated (orthogonal) to each other. We investigated the relationship between this angle and the degree of categorization, which corresponds to the maximal slope of the psychometric curve fitted to the participant’s responses in the identification task (see Fig. 2).
The psychometric curve is described by the sigmoid function
$$\begin{aligned} p(t)=100/[1 + e^{\beta (t-t_0)}], \end{aligned}$$
(19)
where t is the VOT, p(t) is the probability of choosing /ta/ when VOT \({}=t\), and \(t_0\) corresponds to the value of t at the curve’s inflection point (\(p=50\%\)). The maximal slope of the psychometric curve happens at \(t=t_0\) and is equal to \((100/4)\beta\) (in %/ms units). A large value of \(\beta\) indicates a stronger categorical perception by the participant [29]. The results for the 11 participants are shown in Fig. 6. As it can be seen in the figure, the angle is significantly correlated with the slope in the population (Pearson’s \(r = 0.67\), \(t[9] = 2.68\), \(p < 0.05\)).
Assessment of the RoLDSIS technique
What if the regression problem were overdetermined?
In our experiment, we computed the average of the ERPs for each stimulus, in order to reduce the SNR of the obtained signals. This results in an HDLSS problem, which usually requires the use of regularization to solve the regression problems defined by Eqs. 17 and 18 . The HDLSS problem can be potentially alleviated by having more data observations available. This can be done artificially, without actually collecting more data, by splitting the currently available data into groups containing a smaller number of trials. If the number of observations is larger than the number of features, the equation systems become overdetermined and classical least-squares linear regression can be used. We assessed this possibility by solving the regression with different numbers of trials per observation. We did it for all participants. For a number of trials per observation greater than one, the trials were assigned at random to each observation. The results for the root mean square (RMS) regression errors for the \({\mathbf {b}}_\Phi\) and the \({\mathbf {b}}_\Psi\) axes across the population are summarized in Fig. 7. The y attributes of our stimuli vary between − 52 and + 16 ms, in the \(\phi\) case, and between 0 and 1, in the \(\psi\) case. From the figure, one can see that the RMS regression errors are considerably high when there is one trial per observation, with the population mean being 19.1 ms in the \(\phi\) case and 0.35 in the \(\psi\) case. When the number of trials per observation increases, the RMS decreases almost linearly towards zero, a value that is theoretically attained when the number of observations is less than 128.
Reliability of the neurophysiological axes computation
The separation between the \({\mathbf {b}}_\Phi\) and \({\mathbf {b}}_\Psi\) axes, expressed as the angle between these two directions (see Fig. 6) could be simply the result of a statistical fluke. In order to assess this issue, we ran a bootstrap procedure in which, for each participant and for each stimulus, the trials were resampled with replacement. One hundred new estimations for each \({\mathbf {b}}_\Phi\) axis and \({\mathbf {b}}_\Psi\) axis were thus obtained using RoLDSIS on each resampled data set. The values obtained in this procedure represent directions in the \({\mathbb {R}}^{128}\) space of wavelet features. The axes are unit vectors, lying on a 127-dimensional hypersphere, and can thus be represented by 127 spherical coordinates (the analogous of azimuth and elevation in a 3D sphere) [30]. In order to assess the results, PCA was applied to the set of 200 points (including both physical and psychophysical cases) transformed into spherical coordinates, using the prcomp function of the R software. This procedure was applied separately to each participant. The two first principal components (PCs) explain, on average, 50% of the variance, with a minimum of 32% and a maximum of 89% in the population. Linear discriminant analysis (LDA) was then applied to the data projected onto the first two PCs, using the lda function of MASS package of the R software [31]. LDA works by finding the linear transformation that maximizes the ratio between the inter-class and the intra-class variances. The resulting LDA separatrix defines the linear decision boundary that optimally separates the \({\mathbf {b}}_\Phi\) and the \({\mathbf {b}}_\Psi\) points. The reliability of the RoLDSIS procedure is assessed by the amount of LDA misclassifications, whose median value across participants is 7 (minimum value 0, maximum value 63). Figure 8 shows the results of this PCA–LDA procedure for a representative participant.
Comparison with regularized linear regression procedures
A legitimate question that may arise at this point is how the RoLDSIS procedure compares with other regression techniques commonly used in HDLSS problems. In order to make this comparison, we considered three popular regression techniques, namely LASSO [32], Ridge Regression [9] and SPLS [10]. These techniques have regularization parameters (\(\lambda\) for LASSO and Ridge Regression, and \(\zeta\) and K for SPLS) whose optimal values can be found by using a CV procedure. This procedure runs as follows. First, the trials for each stimulus are randomly split into k groups, called folds. Second, we take each group in turn, put it aside as the test data set, and use the data in the remaining \(k-1\) groups to fit the model. The fitted model is used to compute the prediction errors on the test data set. The total CV error is the mean value of the prediction errors computed across the k steps. The model fitting is done for specific values of the regularization parameters, according to the particular regression technique being tested. Using a gradient-descent optimization procedure, we found the optimal values of the regularization parameters that yield the minimum value of the CV error. Since RoLDSIS has no regularization parameter, the optimization procedure described above does not apply to it. This procedure was applied to each of the eleven participants and to each of the physical and psychophysical attributes. Figure 9 shows the population mean of the CV errors for the number of folds varying from 3 to 6, as well as the 95% confidence intervals of the mean estimations.
In order to assess how differently the regression techniques perform on our data, we fitted a linear mixed model to the results, considering the number of folds as a continuous fixed factor, the regression technique as a fixed discrete factor, and the participant as a random factor. The mean squared error (MSE) values, which follow a \(\chi ^2\) distribution, were transformed to normal [33] and the resulting values were used as the dependent variable of the linear model. The results show a significant increase in MSE with the number of folds (\(F[1,158]=50.4, p < 0.001\) for physical and \(F[1,158]=32.2, p < 0.001\) for psychophysical). For the physical case, there was a significant effect for the method factor (\(F[3,158]=5.22, p < 0.01\)), and multiple comparisons showed significant differences among all pairs of methods, besides the RoLDSIS / Ridge Regression pair. For the physical case, the method factor has a marginal effect (\(F[3,158]=2.47, p < 0.064\)). In this later case, no significant differences were found between RoLDSIS, LASSO, and Ridge Regression, but SPLS was significantly different from the others.
Time-frequency representation of the neurophysiological axes
As illustrated in Fig. 4, the RoLDSIS results can be useful for revealing the locations, in the time-frequency domain, associated with the stimulus attributes (physical and psychophysical in the present paper). Since the regression is obtained on an individual basis, the patterns of time-frequency distribution associated with the neurophysiological axis may differ from one participant to another. Therefore, it would be interesting to know whether there are global time-frequency patterns that arise in the population.
This investigation involved RoLDSIS, as well as the three other regression techniques considered in the previous section, and consisted in the computation of the population-wide histogram of the neurophysiological axis in the time-frequency domain. The first step of this analysis is to compute the squared values of the components of the axis \({\mathbf {b}}\) obtained by the regression technique. The squared value of a given wavelet coefficient can be interpreted as the importance (or the “energy”) of the neurophysiological axis at the associated time-frequency slot. The resulting values were then accumulated for all participants, separately for the \({\mathbf {b}}_\Phi\) and the \({\mathbf {b}}_\Psi\) axes, and the square root was computed for the sums at each wavelet component. For the RoLDSIS technique, the average of the ERPs for each stimulus were used for doing the regression, whereas, for the other techniques, the regression result of the 3-fold CV were used (see previous section).
The results are shown in Fig. 10 in the form of time-frequency scalograms. The darker a DWT component appears in a scalogram, the more it contributes to the associated neurophysiological direction across the population.