Speech target modulates speaking induced suppression in auditory cortex
© Ventura et al. 2009
Received: 12 January 2009
Accepted: 13 June 2009
Published: 13 June 2009
Skip to main content
© Ventura et al. 2009
Received: 12 January 2009
Accepted: 13 June 2009
Published: 13 June 2009
Previous magnetoencephalography (MEG) studies have demonstrated speaking-induced suppression (SIS) in the auditory cortex during vocalization tasks wherein the M100 response to a subject's own speaking is reduced compared to the response when they hear playback of their speech.
The present MEG study investigated the effects of utterance rapidity and complexity on SIS: The greatest difference between speak and listen M100 amplitudes (i.e., most SIS) was found in the simple speech task. As the utterances became more rapid and complex, SIS was significantly reduced (p = 0.0003).
These findings are highly consistent with our model of how auditory feedback is processed during speaking, where incoming feedback is compared with an efference-copy derived prediction of expected feedback. Thus, the results provide further insights about how speech motor output is controlled, as well as the computational role of auditory cortex in transforming auditory feedback.
The role of auditory feedback in speech production is a topic of longstanding interest that has been investigated via a number of methods, most recently in studies using functional neuroimaging methods. Previous studies using magnetoencephalography (MEG) have revealed a phenomenon called speaking-induced suppression (SIS): a reduced response in auditory cortex to self-produced speech, compared with its response to externally-produced speech. These studies examined the M100 response, also called the N100m response, which is the most significant peak in the magnetic response of cortex occurring approximately 100ms after the onset of an auditory stimulus, and found a dampened auditory M100 response to a person's own voice when speaking compared to conditions in which a person listens to recorded speech being played back to them [2–4]. Researchers have also found that when self-generated voice sounds were different from the expected sounds, auditory cortex response was maximal, but if the output during speech production matched the expected sound, cortical activity in the auditory cortex was suppressed . Heinks-Maldonado, Nagarajan and Houde  proposed a precise forward model for speech production. They suggested that a forward model operates in the auditory system during speech production, which caused maximal suppression of the auditory cortical response to the incoming sounds that most closely match the speech sounds predicted by the model. Researchers have argued that precise auditory suppression during speech allows the auditory system to distinguish between internally and externally produced speech sounds [7, 8].
If Houde and Nagarajan's model of auditory feedback processing is correct, one would expect utterance rapidity and complexity to affect SIS because temporal misalignment between actual feedback and the prediction only affects the prediction error for dynamic articulations. Thus, the goal of this paper is to test the above two model predictions about differences in SIS for static versus rapid, dynamic speech targets. We did this by comparing differences between auditory processes during speech production compared to auditory processes during passive listening across three different conditions with various speech targets (/a/, /a-a-a/, /a - a-a - a/). If the model is correct, we would expect a maximal difference in magnitude of the M100 response between speak and listen tasks in the simplest condition (/a/). However, with increasing rate and complexity of utterances (/a-a-a/, /a - a-a - a/), the speaking induced suppression should be reduced, and the difference in magnitude of the M100 response between speak and listen amplitudes should be smaller in the complex utterances.
An analysis of the sensor Root Mean Square (RMS) M100 amplitude data during speech tasks revealed significant hemisphere and task differences. A repeated measures ANOVA with condition, task and hemispheres as factors revealed significant differences for hemisphere, p = .0336, and task, p = 0.000035. In contrast, M100 latency data revealed no significant differences.
Source space analysis using virtual sensors were used to analyze the M100 response arising from auditory cortex in each hemisphere in a 3 × 2 × 2 repeated measure ANOVA. The virtual sensor M100 amplitude data revealed significant differences for hemisphere, p = 4.42e-08, task, p = .0001 and a trend towards significance for the interaction between condition and task, p = .0504. To further investigate this interaction between task (speak or listen) and condition (static versus dynamic speech targets), we compared virtual M100 amplitude data from the simple speech target (condition 1) versus dynamic speech targets (conditions 2 and 3 combined). This ANOVA revealed significant differences for hemisphere, p = 0.000003, task, p = 0.0005, and a significant interaction between task and condition, p = .0353. Interestingly, the interaction between hemisphere and task did not reach significance, p = .0878 suggesting that task effects are similar across the two hemispheres.
No differences were observed in sensor RMS or virtual sensor M100 response latencies in the speech tasks.
Rapidity and complexity of the uttered syllable appears to modulate SIS of the M100 amplitude. SIS percent differences were largest with simple, static utterances in condition 1, smaller with rapid utterances in condition 2, and smallest with complex utterances in condition 3. Thus, the greatest difference between speak and listen M100 amplitudes was found in the static speech target (/a/), compared to the dynamic utterances (/a-a-a/ and /a - a-a - a/). These findings are consistent with predictions from our model of speech feedback processing. The greatest speaking induced suppression was observed in condition 1 with the simple utterance presumably because the internal representation, or mental model, for that utterance was largely static and therefore easy to produce and match. However, with increasing rate and complexity of utterances (conditions 2 & 3), the auditory feedback predictions became more dynamic and more difficult to keep in temporal registry with the incoming auditory feedback, resulting in a poorer match with it, and, thus, a less suppressed response.
The differences in amplitude results across conditions are also in accord with Houde and Nagarajan's (2007) model of speech feedback processing: one's expectation for a speech sound (including volume) is related to the activity observed in the auditory cortex. If a participant spoke /a/ loudly, that participant could predict the sound of that utterance and the auditory cortex will not be "surprised" by the volume of the utterance. In such a scenario, one would expect to observe attenuated activity, or reduced activity in the auditory cortex. If a participant spoke /a-a-a/ at a reduced volume, that participant could still predict the sound of that utterance and one would again expect to observe attenuated activity in the auditory cortex. However, during the listen task, participants could not predict the sound or volume of the auditory stimuli. Therefore, the auditory cortex behaved correspondingly: larger amplitudes were observed with louder stimuli, and smaller amplitudes were observed with more quiet stimuli. This is directly related to the prediction/expectancy aspect of the model proposed by Houde and Nagarajan: if one's internal representation for a speech sound (including volume) matches the actual speech sound, then suppression or attenuation of cortical activity in the auditory cortex is observed. This process of matching one's internal representation of a speech sound to the actual speech sound is only possible in the speak task. On the other hand, all stimuli are unexpected during the listen task, and thus response in the auditory cortex should behave solely according to the properties of the auditory stimuli (i.e. larger M100 amplitudes with louder stimuli, etc.).
Given that different auditory stimuli were used in this study, the hemispheric differences in observed responses are noteworthy. Several studies have reported no hemispheric differences when tones were used [13, 14]. This is consistent with our findings: significant activations were observed in primary auditory cortex (mainly, in Heschl's gyri) in both hemispheres when pure tones were used . In contrast, previous literature suggests that the left hemisphere is dominant during speech and language perception [2, 4]. Therefore, a more dominant response was expected to occur in the left hemisphere, in other words a dampened response in the right hemisphere was expected, when speech stimuli were used. During both the speak and listen tasks, we observed this overall dampened M100 amplitude response in the right hemisphere. However, the effect of condition on SIS is the same across both hemispheres. Thus, in spite of the overall hemispheric differences in response to speech, the processing of auditory feedback during speaking may be similar across the two hemispheres.
These findings provide additional support for our conceptual model of speech motor control, and as such provide the impetus to test other predictions from the model. In addition, these findings also provide better insights into the speech motor control system, and the computational role of auditory cortex in transforming auditory feedback. The SIS paradigm used in this study may benefit the study of disorders such as schizophrenia, in which patients lack the ability to distinguish between internally and externally produced speech sounds . It may also benefit the study of speech production impediments such as stuttering [16, 17], where altered auditory feedback has been shown to be fluency-enhancing.
Ten healthy right-handed English speaking volunteers (6 males, 4 females; mean age 25 years; range: 21–42) participated in this study. All participants gave their informed consent after procedures had been fully explained. The study was performed with the approval of the University of California, San Francisco Committee for Human Research.
Calibration of the acoustic stimuli was conducted prior to starting the experiment to ensure that the volume through the earphones was equivalent in both speak and listen tasks. Each MEG session began and ended by recording Auditory Evoked Field (AEF) responses, which were elicited with 120 single 600-msec duration tones (1 kHz), presented binaurally at 70 dB sound pressure level (SPL).
The Experimental Design
Simple Speech Condition 1: /a/
Speak /a/ 75 times and record utterances.
Listen to playback of the 75 recorded /a/ utterances.
Rapid Speech Condition 2: /a-a-a/
Speak /a-a-a/ 75 times and record utterances.
Listen to playback of the 75 recorded /a-a-a/ utterances.
Complex Speech Condition 3: /a - a-a - a/
Speak /a - a-a -a/ 75 times and record utterances.
Listen to playback of the 75 recorded /a - a-a - a/ utterances.
A key feature of the experiment design related to analysis of the results is that the experiment is fundamentally a comparison between the speaking and listening conditions. Thus, although there are likely to be measureable differences in the audio recorded for productions of the three different speech targets used in the experiment (e.g., f0, formants), for each target, the audio heard by the subject is the same in both the speaking and listening conditions. Any response characteristics specific to the audio features of a given target are therefore removed when we compare responses to this target between the speaking and listening conditions.
A structural magnetic resonance image (MRI) was obtained for each participant at the Magnetic Resonance Science Center of UCSF. The whole head was imaged on a 1.5T General Electric scanner with approximately 124 slices, 1.5 mm thick.
Auditory Evoked Field (AEF) data (average response to 120 single pure tones) were band-pass filtered at 2–40 Hz, the third gradient of the magnetic field was calculated, and the DC offset was removed . The average AEF was analyzed using equivalent current dipole (ECD) techniques . Single dipole localizations for each hemisphere were obtained and the AEF response to 1 kHz pure tones elicited cortical activity in the auditory cortex in both hemispheres. Average MNI coordinates for left hemisphere (x, y, z) = -62.5, -20.6, 9.5 and for right hemisphere (x, y, z) = 61.8, -11.5, 8.18, revealed activation in primary auditory cortex (Brodmann areas 41, 42) and superior temporal gyrus in the normalized brain across subjects.
To assess activity changes in auditory cortex, two methods were used: standard Root Mean Square (RMS) averaging of detector measurements, as well as adaptive spatial filtering. Adaptive spatial filtering or beamforming is a spatial filtering technique that estimates the source signal specifically in the auditory cortex by attenuating uncorrelated activity in other brain regions, thereby increasing the signal to noise ratio [20–22]. The Synthetic Aperture Magnetometry (SAM) parameters were as follows: bandwidth 0–300 Hz, Z-threshold for weights = 5.0, and time windows from -200–300 ms. This results in a "virtual channel estimate" of the activation specifically localized in the auditory cortex during speech vocalizations. A virtual channel was created for each condition (/a/, /a-a-a/, /a - a-a - a/) and each task (speak or listen) per hemisphere (left or right) (Figure 4).
Statistical analysis was based on the M100 response, which was defined as the amplitude of the largest peak occurring within a designated time window, 60 to 120 ms post stimulus . For the virtual channel estimate data, a three-way repeated measures Analysis of Variance (ANOVA) was conducted, and a separate ANOVA was conducted using the RMS data for comparison. A simple one-way, within subjects ANOVA was used to analyze the AEF responses to pure tones (amplitude and latency) in both hemispheres. One participant's data was excluded from RMS and virtual channel analyses due to severe contamination from dental artifacts during the speaking task; however, AEF analysis included all ten participants.
For static articulations, after speech onset, the effect of temporal misalignment errors between actual auditory feedback and the prediction is minimal because the articulators are moving slowly and feedback prediction is not changing quickly over time. Therefore, in the case of simple articulations, such as a single vowel /a/ in condition 1, temporal inaccuracies should have little effect on prediction error, and thus increase SIS. In contrast, for dynamic articulations, such as /a-a/, immediately after onset, the articulators are already in motion to realize the next articulatory goal (in this case, the glottal stop between the first and second productions of /a/). Any temporal misalignment between auditory feedback and the prediction will contribute to a larger prediction error since the feedback prediction is changing rapidly over time. Therefore, in the case of rapid, dynamic articulations in conditions 2 & 3, temporal inaccuracies should increase prediction errors, and thus decrease SIS.
Analysis of the acoustic output amplitude was conducted in order to verify that the volume participants heard through the earphones in both speak and listen tasks was equivalent. Peak amplitudes of the first syllable in all three conditions were analyzed using a one-way within subjects ANOVA.
The authors would like to thank Susanne Honma and Anne Findlay for their technical assistance. This work was supported by a grant from the National Institute on Deafness and other Communication Disorders (RO1 DC006435).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.