Top-down and bottom-up modulation in processing bimodal face/voice stimuli
© Latinus et al; licensee BioMed Central Ltd. 2010
Received: 27 July 2009
Accepted: 11 March 2010
Published: 11 March 2010
Processing of multimodal information is a critical capacity of the human brain, with classic studies showing bimodal stimulation either facilitating or interfering in perceptual processing. Comparing activity to congruent and incongruent bimodal stimuli can reveal sensory dominance in particular cognitive tasks.
We investigated audiovisual interactions driven by stimulus properties (bottom-up influences) or by task (top-down influences) on congruent and incongruent simultaneously presented faces and voices while ERPs were recorded. Subjects performed gender categorisation, directing attention either to faces or to voices and also judged whether the face/voice stimuli were congruent in terms of gender. Behaviourally, the unattended modality affected processing in the attended modality: the disruption was greater for attended voices. ERPs revealed top-down modulations of early brain processing (30-100 ms) over unisensory cortices. No effects were found on N170 or VPP, but from 180-230 ms larger right frontal activity was seen for incongruent than congruent stimuli.
Our data demonstrates that in a gender categorisation task the processing of faces dominate over the processing of voices. Brain activity showed different modulation by top-down and bottom-up information. Top-down influences modulated early brain activity whereas bottom-up interactions occurred relatively late.
The ability to integrate information from several sensory modalities is a vital skill of the human brain, as information we receive from the external world is often multimodal. Although there has been a recent surge of research focusing on the processing of multimodal information, our knowledge of the neural substrates underlying this ability for complex stimuli in humans is still limited.
Researchers have used two main paradigms to investigate multimodal processing. One is designed to assess the perceptual gain of multisensory inputs by comparing the behaviour and the neural activity evoked by multimodal and unimodal inputs [1, 2]. The other paradigm assesses the competition between senses using bimodal stimuli which could be either congruent or incongruent; using incongruent stimuli can reveal the existence of a cross-modal bias . These two approaches yield different information: the first determines the advantages and limits of multimodality, while the second provides information on sensory dominance and its influence on task performance. The present study investigates sensory competition or dominance in the processing of gender in bimodal face/voice stimuli.
Sensory dominance has been largely studied in terms of spatial localisation or temporal discrimination. The research approach of comparing congruent and incongruent bimodal stimuli has demonstrated that the influence of the senses is asymmetric and task-dependent. For example, in ventriloquism, the visual-spatial information biases the localisation of the source of auditory information toward the source of visual information [4–6]. The localisation of a visual stimulus is however, almost unaffected by simultaneous discordant auditory information . In contrast, in the temporal domain, the auditory modality dominates the visual, i.e. when subjects judge temporal aspects of a stimulus (frequency of occurrence, temporal frequency, etc.), auditory stimuli modulate perceived information in the visual modality [7–9]. These results suggest that in the spatial domain, vision dominates audition, while in the temporal domain, the reverse is true . Using emotional faces and voices, it has been demonstrated that a static face alters the perception of vocal emotion even when the task required ignoring the face [3, 11]. One aim of the present study was to determine if sensory dominance could be observed in the processing of faces and voices, i.e. is the influence of one sensory modality on the other equivalent or symmetrical in the perception of gender? To this purpose, we manipulated attention through task demands on congruent and incongruent face/voice stimuli.
Neural correlates of multimodal processing have been investigated using fMRI, PET and ERPs, with results showing that bimodal processing was task-sensitive . As in the behavioural literature, various approaches have been used to study neural mechanisms underlying multimodal processing. Comparing the brain activity for bimodal stimuli to the sum of activity for unimodal stimuli (e.g., AV - (A+V)) revealed that congruent bimodal stimuli enhanced brain activity either in sensory-specific cortices [1, 13, 14] or in brain regions described as heteromodal . The timing of this bimodal activation was very rapid, affecting brain processing within 40 ms [1, 16–18]. Even with more biological stimuli (sounds and pictures of animals), early interactions between visual and auditory processing were seen on the visual N1 component (~150 ms) .
Investigations of higher-level multimodal processing critical to human social interactions (faces and voices) have been less common, with most studies on face and voice integration focussed on speech processing. The interaction between visual and auditory stimuli in the speech domain is classically demonstrated by the McGurk effect . As seen with simple bimodal object and spatial processing, there is a behavioural advantage of bimodal redundant speech . Audiovisual integration of faces and voices has also been shown in non-human primates, as monkeys are able to match a face and a certain vocalisation , demonstrating its wider application to other social species. The small literature on face/voice interactions in a non-verbal context is largely focussed on emotional processing [23–25]. Emotion expression protocols have also been used with monkeys, as Parr (2004) showed, in a match-to-sample task, a modality preference depending on the expression to be matched . Bidirectional interference in processing has been demonstrated with incongruent emotional voices and faces  suggesting no sensory dominance in the processing of emotions. Congruent emotional faces and voices enhance the auditory N1 [11, 25]; yet, in a bimodal speech perception study, the opposite was demonstrated, a reduced N1 to congruent bimodal stimuli .
Although face/voice associations to extract non-speech information have been rarely studied, there is a wealth of face and voice processing studies in unimodal paradigms. A large literature provides evidence that faces are processed through a distributed and hierarchical network ; neurophysiological studies provide latencies for the different stages of face processing. The N170 component is sensitive to a range of manipulations of faces [29–32] suggesting that it reflects automatic face processing [33, 34]. Earlier components have also been reported to be face-sensitive [35, 36].
Comparable studies have been completed with voices, often referred to as 'auditory faces' due to the similarity of information carried by faces and voices [37, 38], and have revealed that the processing of non-speech information of voices involved structures located along the right superior temporal sulcus. There are few ERP studies comparing voices to other auditory stimuli. Two papers report a positive deflection 320 ms after stimulus onset that is larger to voices than to musical instrument stimuli, labelled the Voice Selective Response (VSR) [39, 40]. A recent study comparing voices to various non-vocal sounds suggests that the voice/non-voice discrimination could occur earlier, in the latency range of the auditory P2, 160-240 ms . The processing of faces and voices seems thus to draw on specialised and distinct brain regions and to have distinct temporal profiles.
We hypothesized that if vision dominates over audition in gender perception, an incongruent face would disrupt the processing of voice gender while an incongruent voice would have less impact on the perception of face gender. On the other hand, if incongruence has a similar effect regardless of whether subjects performed the task on faces or voices, this would suggest an equivalent influence of the two senses on each other. We also hypothesized that directing attention to one or the other modality would modulate brain activity earlier than stimulus congruency. We showed that directing attention to only one modality modulated early ERPs that were more representative of the attended modality. The congruency task required the processing of both auditory and visual information and the pattern of cerebral activity reflected interaction effects. Comparing congruent and incongruent stimuli allowed us to show that faces dominate over voices in the integration of auditory and visual information of gender, and also demonstrated that bottom-up or automatic processing of the bimodal stimuli arose later (~180 ms) in right frontal regions.
Hits and Reaction Times for each attentional task and congruence.
Reaction times (RTs) were influenced by task (F2,36 = 63.09, p < 0.001), being longer in the BOTH task, as the congruency judgment took longer than gender categorisation (paired comparisons, p < .05 - Table 1). Gender categorisation took longer for voices than faces (difference = 151.15 ms, 95% CI = [112.71 191.11], p < .0001 - Figure 2b, Table 1). Finally, incongruent stimuli took longer to categorise for all three tasks regardless of attentional conditions (F1,18 = 35.89, p < 0.001 - difference = 44.89 ms, 95% CI = [30.46 59.28]); thus, the bimodal information was processed regardless of whether it was required for the task performance, suggesting an automaticity in face and voice processing.
Early effects, P1 and N1 components
P1 amplitude as a function of electrode in the different attentional tasks.
To Voices (μV)
To Faces (μV)
To Both (μV)
4.827 ± .643
6.018 ± .754
5.431 ± .808
5.223 ± .614
5.667 ± .697
4.992 ± .720
5.349 ± .642
6.999 ± .559
6.251 ± .723
Later effects; visual and auditory P2s, VSR
Neither attention nor congruency affected the visual P2 or the VSR significantly. Both components showed hemisphere effects, however. The visual P2 was larger in the right than in the left hemisphere (F1,18 = 8.54, p = 0.009); the VSR had a shorter latency (F1,18 = 10.4, p = 0.005) and larger amplitude (F1,18 = 17.42, p = 0.001) over the right hemisphere. The auditory P2 has been proposed to index voice processing , yet it was not apparent in our study. We reasoned that the auditory P2 may be masked by the VPP, which occurs in a similar latency range and over the same electrodes.
Spatio-temporal analyses of ERP topography between 170 and 220 ms showed a larger negativity in the BOTH task compared to FACE and VOICE tasks at frontal electrodes (Figure 6c). Post-hoc tests revealed significant differences between VOICE and BOTH on bilateral posterior electrodes as well as differences between VOICE and FACE on bilateral temporal electrodes (Figure 6c). A stimulus-driven congruency effect showed a significantly increased positivity to incongruent stimuli in the P2 latency range between 182 and 230 ms, in right centro-temporal areas associated with an increased negativity in left posterior regions (Figure 6d).
This study investigated the influence of top-down and bottom-up processes on the important human ability of integrating multimodal face/voice stimuli. Top-down influences were manipulated by the task requirements; stimuli were the same in all three tasks, only attentional instructions differed. Bottom-up influences were evident in the processing of congruent versus incongruent stimuli, i.e. how stimulus characteristics influenced the interaction between modalities.
Top-down and bottom-up influences on behaviour
Behavioural data showed that directing attention toward the auditory or visual modality biased the processing of the bimodal face/voice stimuli. With the same bimodal stimuli in the tasks, we showed that RTs were shorter when attention was directed to faces than voices (regardless of congruency). This is in accordance with other reports studying bimodal natural object recognition [19, 27, 43] showing that visually based categorisation is faster than auditory based categorisation. RTs were longer for incongruent stimuli regardless of the direction of attention; thus, the unattended modality affected processing in the attended modality, revealing the automatic processing of bimodal information . Incongruent information modulated subjects' accuracy according to the task. Accuracy was lower in the VOICE task when the voice was presented with an incongruent face, an effect not seen in the FACE task.
This result suggests asymmetrical interference between the processing of faces and voices in gender recognition: faces impact the processing of voice gender more than the reverse. A recent study using ambiguous faces showed that low-level auditory features influence the perception of face gender . Although this result could be seen as opposite to ours, this is not the case as the gender of the faces in the Smith et al. study was ambiguous and thus, gender attribution was mostly based on auditory cues. Asymmetrical interference effects have been reported in studies using various paradigms and stimuli, and have been understood as reflecting a sensory dominance in the processing of particular features [8, 18, 43]. Our results demonstrated that in gender categorisation of faces and voices, visual information dominates auditory information. This dominance of faces over voices for gender discrimination could be explained by different hypotheses of sensory dominance. One is the information reliability hypothesis, which suggests that the dominant modality is whichever is more appropriate and the more efficient for the realisation of the task . In our study the more reliable modality would be vision due to intrinsic properties of the stimuli; information required to perform gender categorisation are easily and immediately extracted from a face, whereas auditory stimuli are always dynamic and thus some number of cycles need to be heard before a voice could be recognised by gender. Another possible hypothesis for the visual dominance would be that sensory dominance results from top-down influences . However, if a stimulus automatically captures attention in one modality (such as faces in the present case), the processing of that stimulus would occur despite attention instructions, and any dominance due to attention would be reduced. This latter explanation is in accordance with studies demonstrating that gender categorisation of faces occurs in the near absence of attention; that gender is automatically extracted from faces . Thus, the automatic processing of faces  would reduce or mask the processing in the auditory modality even when attention was explicitly directed to the voices.
The hardest of the three tasks was to determine if the gender of both face and voice was congruent, reflected by this task's lower accuracy and longer RTs. In other multimodal studies, a behavioural facilitation is often reported with bimodal stimuli [1, 18, 48]. However, in tasks involving identification of a non-redundant target, accuracy is reduced  and RTs are generally longer . In the BOTH task, subjects were not identifying a single target but making a congruency judgement which required the extraction of relevant information from both modalities; it is consistent with the literature that this task was the most difficult.
Behavioural results provide evidence of a modulation of the responses by both top-down and bottom-up influences. Bottom-up incongruent information delayed the processing of gender in the attended modality regardless of attention instructions. Top-down processes also impacted gender categorisation of bimodal face/voice stimuli. We suggest that directing attention to a specific sensory modality led to a competition in attentional resources, particularly evident in the VOICE condition. As face processing appears mandatory , some attentional resources are automatically allocated to faces, which may account for voice processing being less efficient than face processing with the bimodal stimuli. Directing attention to both auditory and visual modalities (BOTH task) led to longer RTs and lower accuracy, again likely reflecting dispersed attentional resources.
Top-down and bottom-up influences on ERPs
The ERP waveforms, regardless of the task, were very similar to those described in the face literature [29, 32]. This supports the suggestion that in our paradigm face processing dominated over voice processing, in accordance with the conclusions from the behavioural data.
Modulation of brain activity by top-down processes
Neurophysiological responses were modulated by task as early as 30 ms, as seen in the dissimilar topographies as a function of the direction of attention. Various studies have reported very early activity reflecting bimodal integration when comparing the response to bimodal stimuli to the sum of responses to unimodal stimuli [1, 17, 18, 51]. Early multimodal effects were explained either as anticipatory effects  or as recruitment of a novel population of neurons by bimodal stimuli in the visual cortex . In the present study, this early modulation reflected top-down processes, as we found early activation of unisensory cortices of the attended modality attributable to preparatory processes. This is in accordance with fMRI and ERP data showing attention-related modulations in modality-specific cortices for bimodal stimuli [49, 52, 53]. In the VOICE task, the observed brain topography to the bimodal stimuli showed a larger activity in fronto-central brain regions, whereas in the FACE condition, activity to the bimodal stimuli was larger in right occipital regions. Thus, directed attention to either vision or audition led to greater activation in the respective modality-specific cortices; based indirectly on comparing our results with results in the literature, as we did not use unimodal stimuli. Topography in the BOTH task differed slightly from the average topography of FACE and VOICE condition particularly over fronto-central regions, which might reflect greater attention to voices in the congruency judgment task, as processing voices is less automatic than faces. This is in accordance with the conclusion of the behavioural discussion; directing attention to both faces and voices led to a spread of attention, seen neurophysiologically as an intermediate topography observed for the BOTH task. The early effects in the present study demonstrated that subjects are able to direct their attention to a specific modality; brain activity for the different tasks being representative of the unimodal activity. This is an important finding and justifies the use of paradigms involving directed attention to one sensory modality.
The early visual P1 was larger when attention was directed to faces, seen in FACE and BOTH tasks, consistent with ERP studies showing a larger amplitude for attended versus non-attended stimuli ; yet the early auditory N1 amplitude did not show modulation by attention. P1 topography differed across the conditions: P1 in FACE and BOTH was maximal over occipital electrodes whereas P1 in the VOICE task was more parietal. These topographical differences suggested overlapping components affecting the P1 in the VOICE compared to the other two tasks. Furthermore, the three tasks impacted P1 and N1 differently, suggesting a modulation of the N1/P1 complex in central regions by the processing of auditory information. The fronto-central N1 recorded in the present study may be the negative counterpart of the P1, generally observed with visual stimuli , or may reflect auditory processing . Unimodal studies of auditory processing find that auditory N1 is enhanced to attended auditory stimuli . The absence of differences on the N1 across the conditions may be due to either a deactivation of auditory cortex when attention was directed to faces or a greater activation of auditory cortex when attention was directed to voices; effects which would cancel each other out, leaving no apparent changes to the bimodal stimuli.
N170 and VPP peaked earlier when attention was directed to both faces and voices (BOTH task), but no amplitude effects were seen. N170 reflects an automatic processing of faces as demonstrated by various studies , and its amplitude is not modulated by attention [34, 50]; thus, we did not expect a difference in N170 across tasks. In contrast, task affected brain activity around 100 ms, in accordance with studies showing that attention modulates the processing of audiovisual stimuli at different latencies .
The auditory P2 was not seen in our data; it probably was obscured by the presence of the VPP. However, we observed a shoulder in the descending slope of the VPP around the auditory P2 latency (180/190 ms ) that may correspond to processes normally underlying P2 in unimodal conditions, such as voice processing . Visual inspection of the grand average ERPs revealed that the shoulder was larger in VOICE and BOTH conditions than in the FACE condition; a larger shoulder would imply increased voice processing. In accordance with this suggestion, in the FACE task, the shoulder appeared to be more evident for incongruent stimuli, implying that voices were still processed when they carried incongruent information irrelevant for the task, consistent with the longer RTs in the FACE condition for incongruent stimuli.
The processing of paralinguistic information of faces and voices is shown to be dependent on the sensory modality to which the attention is directed. Moreover, our data showed that the interaction between the processing of faces and voices is asymmetrical with greater influences of visual information than of auditory information. The modulation of bimodal integration by top-down influences could reflect a general mechanism underlying multimodal integration; it is the first time that multimodal ERPs are shown to be task-dependent in the processing of faces and voices at a relatively low-level of processing.
Modulation of brain activity by bottom-up processes
Congruency affected brain activity between 180 and 230 ms after stimuli onset: incongruent stimuli evoked a more positive activity than congruent stimuli in right anterior frontal regions. fMRI studies using bimodal stimuli have shown that the processing of incongruent and congruent stimuli differed in activation in the inferior frontal gyrus (IFG) and the anterior insula [13, 59–61], areas thought to be heteromodal. Activity in these regions decreased for incongruent stimuli [15, 62]. The localisation of the modulation of brain activity by congruency in the present study is compatible with the suggestion that differences between congruent and incongruent stimuli arise from insula or right IFG, and provides a latency (190 ms) to the previously described effect in the fMRI literature. This result is also in accordance with other ERP studies that reported differences due to congruency over frontal regions before 200 ms . The inferior frontal gyrus and insula in the left hemisphere are thought to reflect the retrieval and manipulation of linguistic semantic representations [64, 65]. Other studies demonstrated the role of right insula and IFG in the detection of asynchrony between auditory and visual stimuli . Our data suggest that those regions could also be involved in more general mismatch judgment such as congruency judgment in terms of gender.
One limitation of the study is the use of natural stimuli that can introduce physical differences between the conditions (e.g. between male and female faces or voices). We were interested in the perception of gender on bimodal face/voice stimuli under normal, ecological conditions; this study allows us to show that using these more natural, less tightly controlled stimuli a bias was observed toward faces in the perception of gender. This result suggests that in everyday life situations the perception of gender from faces will dominate over voices. Further study should investigate the perception of gender on more controlled stimuli: for example by using normalised faces and voices, or by controlling the timbre of individual voices, in order to make the tasks equally difficult across sensory modalities. We believe that this could be assessed by using faces in which all "cultural" cues of gender have been removed and by using vowels instead of words.
Another limitation is the fact that we used only bimodal stimuli. Because we were interested in sensory dominance we did not include unimodal conditions to directly compare responses to bimodal stimuli to responses to unimodal stimuli. It should be noted, first, that the lack of unimodal conditions does not prevent drawing conclusions on the sensory dominance in the perception of voice gender, and second, that the rich literature on both face [29, 32, 67] and voice perception [41, 68, 69] allows for at least an indirect comparison with existing studies. Further studies, however, should certainly include unimodal conditions to assess the gain of multimodal information in the perception of voice gender.
We describe dominance of vision over audition in the perception of voice gender behaviourally and neurophysiologically. We observed that top-down influences modulated the processing of bimodal stimuli as early as 40 ms after stimuli onset, yet this influence depended on the preferential modality for the task, providing evidence for a visual bias in the case of face/voice gender categorisation. This bias may be reversed when studying speech perception - a hypothesis to be validated by further studies. Congruency in face and voice stimuli affected neural responses around 190 ms, suggesting that bottom-up multimodal interactions for gender processing are relatively late.
Nineteen English-speaking adults (9 women, range = 20-35 years, mean = 26.4 years) participated in the study. Subjects reported normal medical history and no hearing problems; all had normal or corrected-to-normal vision. They all provided informed written consent; the experiment was approved by the Sunnybrook Health Sciences Research Ethics Board.
Stimuli and procedure
Stimuli were bimodal auditory/visual stimulus pairs that were front view greyscale pictures of faces, which subtended a visual angle of 8° × 6° (see Figure 1, the face stimuli are published with the consent of the models), associated with a voiced word. Previous studies have reported significant findings with the combination of static faces and voices [27, 70]. Face stimuli were photographs of 3 men and 3 women taken while speaking 14 different words, thus a total of 42 female and 42 male faces. Voice stimuli were 14 monosyllabic French words recorded in stereo from 3 female and 3 male speakers; thus, there were also 42 female and 42 male voice stimuli. The words averaged 300 ms in duration, including 10 ms rise and fall times. French words were used with our English-speaking subjects to limit the extent of semantic processing. The voices and faces were randomly associated to form 84 stimuli: 42 were congruent, being female face/female voice and male face/male voice, and 42 were incongruent (i.e., male face/female voice or female face/male voice). Face stimuli were presented for 300 ms in the centre of a computer screen. Auditory stimuli were normalised for intensity using Matlab; they were presented binaurally through earphones (Etymotic Research, Inc.) at normal speaking levels (68 dB ± 5 dB). Face stimuli onset was synchronised with the onset of auditory stimuli using Presentation software; interstimulus intervals varied randomly between 1300 and 1600 ms.
The subjects performed three different gender judgment tasks: 1) The first task was to indicate with one of two keys (right and left ctrl key) whether the stimuli were congruent or incongruent in terms of gender, i.e. the subjects had to pay attention to both face and voice gender (BOTH). Subjects completed two blocks of 84 stimuli; response key attribution was counterbalanced across subjects. As this task differed in terms of response mapping it was always run first. 2) Attention was directed towards the faces: subjects performed a gender discrimination of faces (FACE) while ignoring the voices for 84 trials. 3) In the third task they performed gender discrimination of the voices (VOICE) while ignoring the faces for 84 trials. In the latter two tasks, participants pressed one keyboard key (right and left ctrl key) for female and another for male. The order of the presentation of these two tasks was counterbalanced across subjects, as was the response key attribution.
EEG recording and analysis
The ERPs were recorded in a dimly lit sound-attenuating booth; participants sat 60 cm from a screen on which stimuli were presented. A fixation cross appeared between presentations and subjects were asked to look at it and refrain from making eye movements. EEG was recorded using an ANT (Advanced Neuro-Technology, Enschede, Netherlands) system and a 64 electrode cap, including three ocular electrodes to monitor vertical and horizontal eye movements. Impedances were kept below 5 kΩ. The sampling acquisition rate was 1024 Hz. FCz was the reference during acquisition; an average reference was calculated off-line.
Continuous EEG was epoched into 600 ms sweeps including a 100 ms pre-stimulus baseline. Ocular and muscular artefacts, or trials containing an amplitude shift greater than 100 μV, were rejected from analyses. Epochs were averaged by condition (6 conditions: congruent/incongruent in the 3 tasks) and filtered using a bandpass filter of 1-30 Hz.
Peak analyses were completed on the classical peaks described in the visual, i.e. P1, N170, P2 and VPP (Vertex Positive Potential - ), and the auditory ERP literature, i.e. N1, VSR . Unimodal auditory stimuli generally evoke biphasic ERPs, the negative N1, mentioned above, followed by the auditory P2 in fronto-central regions, a positive wave occurring between 160 and 240 ms after stimulus onset . An auditory P2 was not seen in our data probably due to its temporal coincidence with the VPP, thus being masked by the VPP. Peak latencies and amplitudes were measured for each participant in a ± 30 ms time-window centred on the latencies of the peak in the grand average (visual - P1: 105 ms, N170: 155 ms, VPP: 160 ms and P2: 220 ms; auditory - N1: 100 ms and VSR: 350 ms, see Figure 3). P1 and P2 were measured at O1/O2, PO7/PO8 and PO3/PO4. N170 was measured at PO9/PO10, PO7/PO8, P7/P8 and P9/P10. VPP was measured at FC1/FC2, FC3/FC4, F1/F2, F3/F4 and C1/C2. Auditory N1 was measured at FC1/FC2, C1/C2 and CP1/CP2, and VSR at AF3/AF4, F3/F4 and F1/F2 (see Figure 4a). Latencies were measured at one time point per hemisphere at the electrode with the largest amplitude. Amplitudes were taken at this latency at the other selected electrodes over the hemisphere .
Peak analyses have been extensively used in ERP literature; however, this technique restrains the analysis to time intervals where a peak is seen. In contrast, spatio-temporal analyses determine when brain activity differs significantly between two conditions and allows ERP differences to be identified independently of peak measures [1, 16]. Studies of multimodal processing have shown early modulation of brain activity around 40 ms [1, 73] that does not correspond to a precise peak. Thus, we also analysed spatio-temporal effects by comparing brain activity at each time point and electrode.
Behavioural data and peak latencies and amplitudes were submitted to repeated measures analyses of variance (using SPSS11); within subject factors were task (3 levels), stimulus (2 levels) and hemisphere (2 levels) for peak latencies plus electrode (different levels depending on the component) for peak amplitudes. After main effects were assessed, we performed paired comparison and post-hoc tests (for interactions) to determine the factors leading to the effects.
Spatio-temporal effects were assessed by comparing brain activity for the different conditions, at each time point and electrode. Repeated measures ANOVA within the general linear model framework were run on the ERPs using Matlab7.2 with task and stimulus as inter-subject factors at each time point and electrode. To estimate the statistical significance of the ANOVA, we calculated a data-driven distribution of F-values using a bootstrap-F method; this method makes no assumption on the normality of the data distribution and is therefore robust to normality violations [74, 75]. Data were centred at 0 to be under the null hypothesis that conditions do not differ from 0. ANOVAs at each time point and electrode were run on the centred data after resampling the subjects with replacement. We stored the bootstrapped F-values for each time point and electrode independently. This operation was repeated 999 times to obtain a distribution of 1000 bootstrapped estimates of F-values under the null hypothesis . To correct for multiple comparison, we stored the maximum F-values obtained across all time points in each random sampling loop and for each electrode independently . We then calculated a 95% confidence interval of the maximum F-values for each electrode. The repeated measures ANOVA was considered significant if the F-value fell outside the bootstrapped 95% confidence intervals for each time point and electrode (Degrees of freedom (df) are similar for all statistics presented in this study: 2 and 36 for the task factor, 1 and 18 for the stimulus factor and 2 and 36 for the interaction (df of factor and error respectively)).
Post-hoc tests were run for the Task factor whenever the ANOVA was significant. Data-driven confidence intervals were calculated for each comparison (VOICE vs. FACE, VOICE vs. BOTH and FACE vs. BOTH). We performed the analyses across subjects by sampling conditions with replacement (electrodes by time points matrices), independently for each subject. For each random sample, we averaged ERPs across subjects independently for each condition, then computed the difference between the averages for the two conditions (for instance VOICE vs. FACE). In each random sampling loop and for each electrode independently, we stored the maximum absolute difference obtained across all time points. This process was repeated 1000 times, leading to a distribution of bootstrapped estimates of the maximum absolute difference between two ERP conditions, averaged across subjects, under the null hypothesis H0 that the two conditions were sampled from populations with similar means. Then the 95% confidence interval of the mean maximum absolute differences was computed at each electrode (alpha = 0.05). Finally, absolute differences between two sample means at any time point at one electrode were considered significant if they fell outside the H0 95% confidence interval for that electrode.
Marianne Latinus was supported by la Fondation pour La Recherche Médicale (FRM, FDT20051206128). We thank Dr. Nancy J. Lobaugh for her generosity in allowing us full access to her ERP lab, and the help provided with the studies by Erin Gibson. We would like to thank Ian Charest and Guillaume Rousselet for their help in implementing the bootstrap analyses of the ERP data.
- Giard MH, Peronnet F: Auditory-visual integration during multimodal object recognition in humans: a behavioral and electrophysiological study. Journal of Cognitive Neuroscience. 1999, 11 (5): 473-490. 10.1162/089892999563544.View ArticlePubMedGoogle Scholar
- Giard MH, Fort A, Mouchetant-Rostaing Y, Pernier J: Neurophysiological mechanisms of auditory selective attention in humans. Front Biosci. 2000, 5: D84-94. 10.2741/Giard.View ArticlePubMedGoogle Scholar
- Vroomen J, Driver J, de Gelder B: Is cross-modal integration of emotional expressions independent of attentional resources?. Cogn Affect Behav Neuroscience. 2001, 1 (4): 382-387. 10.3758/CABN.1.4.382.View ArticleGoogle Scholar
- Bertelson P, Radeau M: Cross-modal bias and perceptual fusion with auditory-visual spatial discordance. Percept Psychophys. 1981, 29 (6): 578-584.View ArticlePubMedGoogle Scholar
- Driver J: Enhancement of selective listening by illusory mislocation of speech sounds due to lip-reading. Nature. 1996, 381 (6577): 66-68. 10.1038/381066a0.View ArticlePubMedGoogle Scholar
- Spence C, Driver J: Attracting attention to the illusory location of a sound: reflexive crossmodal orienting and ventriloquism. Neuroreport. 2000, 11 (9): 2057-2061. 10.1097/00001756-200006260-00049.View ArticlePubMedGoogle Scholar
- Wada Y, Kitagawa N, Noguchi K: Audio-visual integration in temporal perception. Int J Psychophysiol. 2003, 50 (1-2): 117-124. 10.1016/S0167-8760(03)00128-4.View ArticlePubMedGoogle Scholar
- Shimojo S, Shams L: Sensory modalities are not separate modalities: plasticity and interactions. Curr Opin Neurobiol. 2001, 11 (4): 505-509. 10.1016/S0959-4388(00)00241-5.View ArticlePubMedGoogle Scholar
- Bertelson P, Aschersleben G: Temporal ventriloquism: crossmodal interaction on the time dimension. 1. Evidence from auditory-visual temporal order judgment. Int J Psychophysiol. 2003, 50 (1-2): 147-155. 10.1016/S0167-8760(03)00130-2.View ArticlePubMedGoogle Scholar
- Aschersleben G, Bertelson P: Temporal ventriloquism: crossmodal interaction on the time dimension. 2. Evidence from sensorimotor synchronization. Int J Psychophysiol. 2003, 50 (1-2): 157-163. 10.1016/S0167-8760(03)00131-4.View ArticlePubMedGoogle Scholar
- de Gelder B, Pourtois G, Weiskrantz L: Fear recognition in the voice is modulated by unconsciously recognized facial expressions but not by unconsciously recognized affective pictures. Proc Natl Acad Sci USA. 2002, 99 (6): 4121-4126. 10.1073/pnas.062018499.PubMed CentralView ArticlePubMedGoogle Scholar
- Fort A, Giard MH: Multi electrophysiological mechanisms of audio-visual integration in human perception. The Handbook of Multisensory Processes. Edited by: Calvert GA, Spence C, Stein BE. 2004, Cambridge: MIT Press, 503-514.Google Scholar
- Calvert GA, Brammer MJ, Bullmore ET, Campbell R, Iversen SD, David AS: Response amplification in sensory-specific cortices during crossmodal binding. Neuroreport. 1999, 10 (12): 2619-2623. 10.1097/00001756-199908200-00033.View ArticlePubMedGoogle Scholar
- Eimer M: Crossmodal links in spatial attention between vision, audition, and touch: evidence from event-related brain potentials. Neuropsychologia. 2001, 39 (12): 1292-1303. 10.1016/S0028-3932(01)00118-X.View ArticlePubMedGoogle Scholar
- Calvert GA, Campbell R, Brammer MJ: Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex. Curr Biol. 2000, 10 (11): 649-657. 10.1016/S0960-9822(00)00513-3.View ArticlePubMedGoogle Scholar
- Fort A, Delpuech C, Pernier J, Giard MH: Early auditory-visual interactions in human cortex during nonredundant target identification. Brain Research Cogn Brain Research. 2002, 14 (1): 20-30. 10.1016/S0926-6410(02)00058-7.View ArticlePubMedGoogle Scholar
- Teder-Salejarvi WA, McDonald JJ, Di Russo F, Hillyard SA: An analysis of audio-visual crossmodal integration by means of event-related potential (ERP) recordings. Brain Research Cogn Brain Research. 2002, 14 (1): 106-114. 10.1016/S0926-6410(02)00065-4.View ArticlePubMedGoogle Scholar
- Molholm S, Ritter W, Murray MM, Javitt DC, Schroeder CE, Foxe JJ: Multisensory auditory-visual interactions during early sensory processing in humans: a high-density electrical mapping study. Brain Research Cogn Brain Research. 2002, 14 (1): 115-128. 10.1016/S0926-6410(02)00066-6.View ArticlePubMedGoogle Scholar
- Molholm S, Ritter W, Javitt DC, Foxe JJ: Multisensory visual-auditory object recognition in humans: a high-density electrical mapping study. Cereb Cortex. 2004, 14 (4): 452-465. 10.1093/cercor/bhh007.View ArticlePubMedGoogle Scholar
- McGurk H, MacDonald J: Hearing lips and seeing voices. Nature. 1976, 264 (5588): 746-748. 10.1038/264746a0.View ArticlePubMedGoogle Scholar
- Besle J, Fort A, Delpuech C, Giard MH: Bimodal speech: early suppressive visual effects in human auditory cortex. Eur J Neuroscience. 2004, 20 (8): 2225-2234. 10.1111/j.1460-9568.2004.03670.x.View ArticleGoogle Scholar
- Izumi A, Kojima S: Matching vocalizations to vocalizing faces in a chimpanzee (Pan troglodytes). Anim Cogn. 2004, 7 (3): 179-184. 10.1007/s10071-004-0212-4.View ArticlePubMedGoogle Scholar
- Dolan RJ, Morris JS, de Gelder B: Crossmodal binding of fear in voice and face. Proc Natl Acad Sci USA. 2001, 98 (17): 10006-10010. 10.1073/pnas.171288598.PubMed CentralView ArticlePubMedGoogle Scholar
- Pourtois G, de Gelder B, Bol A, Crommelinck M: Perception of facial expressions and voices and of their combination in the human brain. Cortex. 2005, 41 (1): 49-59. 10.1016/S0010-9452(08)70177-1.View ArticlePubMedGoogle Scholar
- Pourtois G, de Gelder B, Vroomen J, Rossion B, Crommelinck M: The time-course of intermodal binding between seeing and hearing affective information. Neuroreport. 2000, 11 (6): 1329-1333. 10.1097/00001756-200004270-00036.View ArticlePubMedGoogle Scholar
- Parr LA: Perceptual biases for multimodal cues in chimpanzee (Pan troglodytes) affect recognition. Anim Cogn. 2004, 7 (3): 171-178. 10.1007/s10071-004-0207-1.View ArticlePubMedGoogle Scholar
- De Gelder B, Vroomen J: The perception of emotions by ear and by eye. Cognition and Emotion. 2000, 14 (3): 289-311. 10.1080/026999300378824.View ArticleGoogle Scholar
- Haxby JV, Hoffman EA, Gobbini MI: The distributed human neural system for face perception. Trends Cogn Sci. 2000, 4 (6): 223-233. 10.1016/S1364-6613(00)01482-0.View ArticlePubMedGoogle Scholar
- Bentin S, Allison T, Puce A, Perez E, McCarthy G: Electrophysiological Studies of Face Perception in Humans. Journal of Cognitive Neuroscience. 1996, 8: 551-565. 10.1162/jocn.19126.96.36.1991.PubMed CentralView ArticlePubMedGoogle Scholar
- George N, Evans J, Fiori N, Davidoff J, Renault B: Brain events related to normal and moderately scrambled faces. Cognitive Brain Research. 1996, 4: 65-76. 10.1016/0926-6410(95)00045-3.View ArticlePubMedGoogle Scholar
- Itier RJ, Latinus M, Taylor MJ: Face, eye and object early processing: what is the face specificity?. Neuroimage. 2006, 29 (2): 667-676. 10.1016/j.neuroimage.2005.07.041.View ArticlePubMedGoogle Scholar
- Rossion B, Gauthier I, Tarr MJ, Despland P, Bruyer R, Linotte S, Crommelinck M: The N170 occipito-temporal component is delayed and enhanced to inverted faces but not to inverted objects: an electrophysiological account of face-specific processes in the human brain. Neuroreport. 2000, 11 (1): 69-74. 10.1097/00001756-200001170-00014.View ArticlePubMedGoogle Scholar
- Puce A, Allison T, McCarthy G: Electrophysiological studies of human face perception. III: Effects of top-down processing on face-specific potentials. Cereb Cortex. 1999, 9 (5): 445-458. 10.1093/cercor/9.5.445.View ArticlePubMedGoogle Scholar
- Severac Cauquil A, Edmonds GE, Taylor MJ: Is the face-sensitive N170 the only ERP not affected by selective attention?. Neuroreport. 2000, 11 (10): 2167-2171.View ArticleGoogle Scholar
- Taylor MJ, Edmonds GE, McCarthy G, Allison T: Eyes first! Eye processing develops before face processing in children. Neuroreport. 2001, 12 (8): 1671-1676. 10.1097/00001756-200106130-00031.View ArticlePubMedGoogle Scholar
- Itier RJ, Taylor MJ: Effects of repetition and configural changes on the development of face recognition processes. Dev Sci. 2004, 7 (4): 469-487. 10.1111/j.1467-7687.2004.00367.x.View ArticlePubMedGoogle Scholar
- Bedard C, Belin P: A "voice inversion effect?". Brain Cogn. 2004, 55 (2): 247-249. 10.1016/j.bandc.2004.02.008.View ArticlePubMedGoogle Scholar
- Belin P, Fecteau S, Bedard C: Thinking the voice: neural correlates of voice perception. Trends Cogn Sci. 2004, 8 (3): 129-135. 10.1016/j.tics.2004.01.008.View ArticlePubMedGoogle Scholar
- Levy DA, Granot R, Bentin S: Processing specificity for human voice stimuli: electrophysiological evidence. Neuroreport. 2001, 12 (12): 2653-2657. 10.1097/00001756-200108280-00013.View ArticlePubMedGoogle Scholar
- Levy DA, Granot R, Bentin S: Neural sensitivity to human voices: ERP evidence of task and attentional influences. Psychophysiology. 2003, 40 (2): 291-305. 10.1111/1469-8986.00031.View ArticlePubMedGoogle Scholar
- Charest I, Pernet CR, Rousselet GA, Quinones I, Latinus M, Fillion-Bilodeau S, Chartrand JP, Belin P: Electrophysiological evidence for an early processing of human voices. BMC Neurosci. 2009, 10: 127-10.1186/1471-2202-10-127.PubMed CentralView ArticlePubMedGoogle Scholar
- Lattner S, Maess B, Wang Y, Schauer M, Alter K, Friederici AD: Dissociation of human and computer voices in the brain: evidence for a preattentive gestalt-like perception. Hum Brain Mapp. 2003, 20 (1): 13-21. 10.1002/hbm.10118.View ArticlePubMedGoogle Scholar
- Joassin F, Maurage P, Bruyer R, Crommelinck M, Campanella S: When audition alters vision: an event-related potential study of the cross-modal interactions between faces and voices. Neuroscience Lett. 2004, 369 (2): 132-137. 10.1016/j.neulet.2004.07.067.View ArticleGoogle Scholar
- Smith EL, Grabowecky M, Suzuki S: Auditory-visual crossmodal integration in perception of face gender. Curr Biol. 2007, 17 (19): 1680-1685. 10.1016/j.cub.2007.08.043.View ArticlePubMedGoogle Scholar
- Andersen TS, Tiippana K, Sams M: Factors influencing audiovisual fission and fusion illusions. Brain Research Cogn Brain Research. 2004, 21 (3): 301-308. 10.1016/j.cogbrainres.2004.06.004.View ArticlePubMedGoogle Scholar
- Reddy L, Wilken P, Koch C: Face-gender discrimination is possible in the near-absence of attention. J Vis. 2004, 4 (2): 106-117. 10.1167/4.2.4.View ArticlePubMedGoogle Scholar
- Bindemann M, Burton AM, Hooge IT, Jenkins R, de Haan EH: Faces retain attention. Psychon Bull Rev. 2005, 12 (6): 1048-1053.View ArticlePubMedGoogle Scholar
- Fort A, Delpuech C, Pernier J, Giard MH: Dynamics of cortico-subcortical cross-modal operations involved in audio-visual object detection in humans. Cereb Cortex. 2002, 12 (10): 1031-1039. 10.1093/cercor/12.10.1031.View ArticlePubMedGoogle Scholar
- Degerman A, Rinne T, Pekkola J, Autti T, Jaaskelainen IP, Sams M, Alho K: Human brain activity associated with audiovisual perception and attention. Neuroimage. 2007, 34 (4): 1683-1691. 10.1016/j.neuroimage.2006.11.019.View ArticlePubMedGoogle Scholar
- Vuilleumier P: Faces call for attention: evidence from patients with visual extinction. Neuropsychologia. 2000, 38 (5): 693-700. 10.1016/S0028-3932(99)00107-4.View ArticlePubMedGoogle Scholar
- Shams L, Kamitani Y, Thompson S, Shimojo S: Sound alters visual evoked potentials in humans. Neuroreport. 2001, 12 (17): 3849-3852. 10.1097/00001756-200112040-00049.View ArticlePubMedGoogle Scholar
- Talsma D, Doty TJ, Woldorff MG: Selective attention and audiovisual integration: is attending to both modalities a prerequisite for early integration?. Cereb Cortex. 2007, 17 (3): 679-690. 10.1093/cercor/bhk016.View ArticlePubMedGoogle Scholar
- Talsma D, Woldorff MG: Selective attention and multisensory integration: multiple phases of effects on the evoked brain activity. Journal of Cognitive Neuroscience. 2005, 17 (7): 1098-1114. 10.1162/0898929054475172.View ArticlePubMedGoogle Scholar
- Rossion B, Campanella S, Gomez CM, Delinte A, Debatisse D, Liard L, Dubois S, Bruyer R, Crommelinck M, Guerit JM: Task modulation of brain activity related to familiar and unfamiliar face processing: an ERP study. Clin Neurophysiol. 1999, 110 (3): 449-462. 10.1016/S1388-2457(98)00037-6.View ArticlePubMedGoogle Scholar
- Näätänen R, Picton T: The N1 wave of the human electric and magnetic response to sound: a review and an analysis of the component structure. Psychophysiology. 1987, 24 (4): 375-425. 10.1111/j.1469-8986.1987.tb00311.x.View ArticlePubMedGoogle Scholar
- Alho K, Sams M, Paavilainen P, Naatanen R: Small pitch separation and the selective-attention effect on the ERP. Psychophysiology. 1986, 23 (2): 189-197. 10.1111/j.1469-8986.1986.tb00617.x.View ArticlePubMedGoogle Scholar
- Jacques C, Rossion B: Concurrent processing reveals competition between visual representations of faces. Neuroreport. 2004, 15 (15): 2417-2421. 10.1097/00001756-200410250-00023.View ArticlePubMedGoogle Scholar
- Michalewski HJ, Prasher DK, Starr A: Latency variability and temporal interrelationships of the auditory event-related potentials (N1, P2, N2, and P3) in normal subjects. Electroencephalogr Clin Neurophysiol. 1986, 65 (1): 59-71. 10.1016/0168-5597(86)90037-7.View ArticlePubMedGoogle Scholar
- Amedi A, von Kriegstein K, van Atteveldt NM, Beauchamp MS, Naumer MJ: Functional imaging of human crossmodal identification and object recognition. Exp Brain Research. 2005, 166 (3-4): 559-571. 10.1007/s00221-005-2396-5.View ArticleGoogle Scholar
- Callan DE, Jones JA, Munhall K, Callan AM, Kroos C, Vatikiotis-Bateson E: Neural processes underlying perceptual enhancement by visual speech gestures. Neuroreport. 2003, 14 (17): 2213-2218. 10.1097/00001756-200312020-00016.View ArticlePubMedGoogle Scholar
- Calvert GA, Hansen PC, Iversen SD, Brammer MJ: Detection of audio-visual integration sites in humans by application of electrophysiological criteria to the BOLD effect. Neuroimage. 2001, 14 (2): 427-438. 10.1006/nimg.2001.0812.View ArticlePubMedGoogle Scholar
- Sestieri C, Di Matteo R, Ferretti A, Del Gratta C, Caulo M, Tartaro A, Olivetti Belardinelli M, Romani GL: "What" versus "where" in the audiovisual domain: an fMRI study. Neuroimage. 2006, 33 (2): 672-680. 10.1016/j.neuroimage.2006.06.045.View ArticlePubMedGoogle Scholar
- Talsma D, Kok A, Slagter HA, Cipriani G: Attentional orienting across the sensory modalities. Brain Cogn. 2008, 66 (1): 1-10. 10.1016/j.bandc.2007.04.005.View ArticlePubMedGoogle Scholar
- Wagner AD: Working memory contributions to human learning and remembering. Neuron. 1999, 22 (1): 19-22. 10.1016/S0896-6273(00)80674-1.View ArticlePubMedGoogle Scholar
- Poldrack RA, Wagner AD, Prull MW, Desmond JE, Glover GH, Gabrieli JD: Functional specialization for semantic and phonological processing in the left inferior prefrontal cortex. Neuroimage. 1999, 10 (1): 15-35. 10.1006/nimg.1999.0441.View ArticlePubMedGoogle Scholar
- Bushara KO, Grafman J, Hallett M: Neural correlates of auditory-visual stimulus onset asynchrony detection. J Neuroscience. 2001, 21 (1): 300-304.PubMedGoogle Scholar
- Itier RJ, Taylor MJ: N170 or N1? Spatiotemporal differences between object and face processing using ERPs. Cereb Cortex. 2004, 14 (2): 132-142. 10.1093/cercor/bhg111.View ArticlePubMedGoogle Scholar
- Belin P, Zatorre RJ, Lafaille P, Ahad P, Pike B: Voice-selective areas in human auditory cortex. Nature. 2000, 403 (6767): 309-312. 10.1038/35002078.View ArticlePubMedGoogle Scholar
- Beauchemin M, De Beaumont L, Vannasing P, Turcotte A, Arcand C, Belin P, Lassonde M: Electrophysiological markers of voice familiarity. Eur J Neuroscience. 2006, 23 (11): 3081-3086. 10.1111/j.1460-9568.2006.04856.x.View ArticleGoogle Scholar
- Campanella S, Belin P: Integrating face and voice in person perception. Trends Cogn Sci. 2007, 11 (12): 535-543. 10.1016/j.tics.2007.10.001.View ArticlePubMedGoogle Scholar
- Jeffreys DA: The influence of stimulus orientation on the vertex positive scalp potential evoked by faces. Experimental Brain Research. 1993, 96 (1): 163-172.View ArticlePubMedGoogle Scholar
- Picton TW, Bentin S, Berg P, Donchin E, Hillyard SA, Johnson R, Miller GA, Ritter W, Ruchkin DS, Rugg MD, et al.: Guidelines for using human event-related potentials to study cognition: recording standards and publication criteria. Psychophysiology. 2000, 37 (2): 127-152. 10.1017/S0048577200000305.View ArticlePubMedGoogle Scholar
- McDonald JJ, Teder-Salejarvi WA, Hillyard SA: Involuntary orienting to sound improves visual perception. Nature. 2000, 407 (6806): 906-908. 10.1038/35038085.View ArticlePubMedGoogle Scholar
- Berkovits I, Hancock GR, Nevitt J: Bootstrap resampling approaches for repeated measure designs: relative robustness to sphericity and normality violations. Educational and Psychological Measurement. 2000, 60 (6): 877-892. 10.1177/00131640021970961.View ArticleGoogle Scholar
- Wilcox RR: Introduction to Robust Estimation and Hypothesis Testing. 2005, SecondGoogle Scholar
- Rousselet GA, Husk JS, Pernet CR, Gaspar CM, Bennett PJ, Sekuler AB: Age-related delay in information accrual for faces: evidence from a parametric, single-trial EEG approach. BMC Neurosci. 2009, 10: 114-10.1186/1471-2202-10-114.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.