A voiced speech signal such as a vowel is created in the human sound production system through phonation and articulation [1]. In normal phonation, the vibrating vocal folds produce a periodic excitation, termed the glottal flow. Due to this inherent periodicity, the spectra of vowels produced by normal phonation are characterized by a harmonic comb structure, i.e., distribution of energy at the fundamental frequency (F0, ranging from 100 Hz in males up to 400 Hz in infants) and its harmonic integer multiples (2 × F0, 3 × F0, etc.) located regularly in frequency [2]. This comb structure is then locally weighted in frequency by the resonances caused by the vocal tract. These resonances, termed the formants (F1, F2, F3, etc.), determine the vowel category. Changing the shape and the length of the vocal tract results in different formant frequency settings and, consequently, in variations of the perceived phoneme category. The F0 and its harmonics are the primary acoustical cues underlying pitch perception and the lowest two formants are regarded as the major cues in vowel categorization [1].
The auditory N1(m) response of the electro- and magnetoencephalography (EEG & MEG, respectively), generated in the auditory cortices of the left and right hemisphere, reflects the acoustic properties of auditory stimuli [[3–10], see [11] for a review]: its amplitude is largely determined by stimulus onset characteristics and stimulus intensity and its latency varies according to both stimulus intensity and frequency. An increase in stimulus intensity decreases the latency of the N1m and, in the 500 – 4000 Hz range, the N1m is elicited at a roughly invariant latency. Interestingly, in the frequency range of speech F0, sinusoidal stimuli result in longer-latency N1(m) responses and this latency delay increases monotonically as stimulus frequency is lowered [12, 13].
With respect to phonation, the latency delay of the N1m is observable both when the F0 is present [14] and absent [11, 15, 16]; in the latter case, provided that the harmonic structure of the high-frequency components is intact, the result is the virtual perception of the fundamental frequency (i.e., the missing fundamental). With regard to articulation, the categorization of vowels might be based on temporal encoding of the formant frequencies [6, 7, 17, 18]. For instance, the vowel /u/, which has relatively low F1 and F2 values (approx. 300 & 800 Hz, respectively), elicits the N1(m) at a longer latency than the vowel /a/, which has higher F1 and F2 values (700 & 1100 Hz, respectively). Previous studies have related these effects either to the F1 [11, 18] or F1 and F2 values [6, 7, 17] of these vowels.
These latency effects of the N1m elicited by vowels have been documented to occur symmetrically in the two hemispheres [6, 7, 11, 17, 18]. This symmetry appears rather interesting when considering that speech stimuli comprising consonants [4, 19] have been found to elicit asymmetric N1m response behavior. However, given that vowels are the core phonemes of speech utterances [2], and that they comprise spectral energy preferred by either the left or the right hemisphere (i.e., formant frequencies and glottal periodicity, respectively; [20]), one would expect that isolated vowel sounds should result in hemispheric asymmetries as indexed by the auditory N1m response. Hemispheric specificity of speech processing notwithstanding, no consensus has been reached on whether cerebral asymmetries are brought about only by attentional top-down modulation of cortical activity [21] or whether they might be found already in the passive recording condition when the subject is not engaged in the attentive processing of vowel stimuli.
To summarize, the effects of voice excitation and articulation on cortical activity elicited by vowels have been studied extensively – but, more often than not, in isolation. This, obviously, might be considered a shortcoming in cognitive brain research, further emphasized by the fact that the two issues are inseparable in real speech communication. In addition, studies addressing the combined effects of phonation and articulation have typically used a much too narrow perspective in characterizing voice excitation; it is often quantified in terms of F0 alone while the role of the type of the excitation, and thereby also the set of underlying spectral cues, is ignored. This limited perspective, again, can be criticized from the point of view of natural speech communication: As an example, two representatives of the vowel /a/ can be created with equal F0s but with greatly different types of the voice excitation waveform. This results in two speech sounds, both perceived as the phoneme /a/ and, importantly, of the same pitch. However, their voice quality can be clearly different due to differences in the type of the excitation waveform. It is, for example, possible that the one /a/ sounds breathy due to use of a soft pulseform in the glottal excitation whereas the voice quality of the other /a/ is perceived as pressed resulting from the use of a sharper shape in the glottal excitation pulseform [22, 23].
Besides the above-mentioned, restricted view on the role of the voice excitation type, we hasten to emphasize another, equally overlooked an issue in studies of speech production and perception: because of the wide range of their F1 and F2 values, vowels are also fundamentally different in terms of the distribution of energy over frequency. For instance, due to its high F1 and F2, the sound energy in the vowel /a/ is distributed across a wide, 0–2 kHz range of high-energy harmonics. However, in the case of, say, vowel /u/, the low positions of F1 and F2 strongly attenuate the higher harmonics and most of the sound energy is actually allocated at frequencies below 1 kHz. This, then, results in variations in the perceived loudness of the stimuli, despite attempts to adjust the intensity of the stimuli using objective measures such as the sound pressure level (SPL).
Recent studies conducted in the passive recording condition indicate that the overall harmonic structure of vowels should perhaps not be overlooked in descriptions of speech-evoked cortical activity. For one, the amplitude of the N1m is already modulated by the presence of periodic glottal excitation in vowel sounds: a vowel with this kind of excitation elicits larger-amplitude N1m responses than the same vowel with an aperiodic, intensity-matched noise excitation [24]. Further, the amplitude of the N1m reflects temporal changes in the harmonic structure of speech created by glides in F0 while corresponding glides in pure tones do not affect the N1m amplitude [25]. Contrasting these observations, both the amplitude and latency of the N1m are unaffected by the identity of loudness-matched vowels (/a/, /o/, & /u/) [26] and by the lack of phonetic F1,F2-content in natural, periodically excited vowels [27]. Regardless of the formant frequencies, the latency of the N1m elicited by speech sounds with different F0-values appears to be invariant and shorter than the latency of the N1m elicited by pure tones whose frequencies are adjusted to match the F0 of the speech sounds [25, 27]. Thus, these findings tentatively suggest that the presence of periodic glottal excitation in auditory stimulation might be an important prerequisite for the elicitation of speech-specific cortical activity.
Given the lack of data on the combined effects of phonation and articulation, the present study was designed to investigate how different combinations of voice excitation (phonation) and formant frequencies (articulation; for a description of the stimuli, see Fig. 1) are reflected in the cortical processing of vowels as indexed by the auditory N1m response. To investigate the effects of phonation, we used the periodic glottal excitation extracted from a natural utterance and contrasted its effects with those of an aperiodic noise waveform and a tonal excitation represented by two sinusoids. The effects of articulation, in turn, were analyzed by introducing two natural-sounding vowels with an intact harmonic structure (/a/
per
& /u/
per
) and located in the opposite corners of the F1,F2-space. Hence, as illustrated in Fig. 1, the study comprised two phonemes with known formant values, but created by three substantially different variants of excitation. The spectra of the vowels excited by aperiodic noise (/a/
aper
& /u/
aper
) were similar to their periodic counterparts, both in terms of the formant frequencies and the overall spectral envelope structure but, importantly, they lacked the comb structure of natural speech. Further impoverishing the stimulation, we also utilized two-tone complexes /a/
tone
and /u/
tone
, where the sound energy was concentrated at two distinct frequency peaks corresponding to the F1 and F2 of /a/ and /u/.
Perceptually, the vowels /a/
per
and /u/
per
were of normal voice quality while their aperiodic, noise-excited counterparts matched for intensity resembled whispered speech. Both had a rich spectral structure and were recognizable as speech. In contrast, the tonal stimuli had an extremely sparse spectral structure not perceivable as speech. Based on previous research [11, 12, 14–16, 24–27], we hypothesized that the type of phonation (voice excitation) should be reflected in latency variations of the N1m response. With regard to articulation, we expected that the different sound energy distributions of the vowels /a/ and /u/, caused by the different articulatory settings as explained above, should result in variations in the amplitude of the N1m. With regard to amplitude, latency, and source localization of the N1m, we were specifically interested to see whether asymmetries in the left- vs. right-hemispheric brain activity might arise already in the passive recording condition. Finally, in line with the tentative findings reported in [24], the experimental design allowed us to study whether human speech consisting of an intact, natural harmonic structure leads to a different spatial distribution of cortical activation than unnatural utterances.