By 2050, hearing loss is expected to affect 900 million people worldwide [1]. The cochlear implant (CI) is one of the most successful prostheses [2]. To date, more than 700,000 patients globally have been fitted with Cis; most of these patients communicate fluently under in clear conditions [3,4,5]. Although CIs allow users to understand up to 90% of all words in sentences spoken in clear environments, further challenges are encountered in noisy environments [3]. In CIs, the only signals transmitted are the temporal envelope cues of various frequency regions; the temporal fine structure (TFS) cues of the original acoustic signals are discarded. Many scholars have suggested that the lack of TFS cues partly explains the hearing difficulties experienced in noisy environments [6,7,8,9].
China accounts for approximately 20% of the world’s population and the socioeconomic burdens of hearing loss in China are immense [10]. By a conservative extrapolation, there is an estimated annual demand of 100,000 CIs in China [11]. The widespread use of CIs, which transmit only temporal envelope cues, by Chinese speakers also sparks a theoretical interest in the contribution of temporal envelope cues across frequency regions to Mandarin perception, which, unlike English, is a tonal language. Therefore, in this study, we used temporal envelope cues under noisy conditions to focus on the perception strategies adopted by Chinese speakers for Mandarin perception. This was done with the ultimate goal of developing optimal CIs for Chinese-speaking CI users.
To simulate the stimulation pattern of CIs, Shannon et al. divided the frequency spectrum into continuous broad-frequency bands (i.e., analysis filters) and then extracted the temporal envelope cues from different frequency bands to modulate noises of the same bandwidths [12]. The recognition performance increased with the number of bands [12]. The number of frequency bands needed for good speech recognition increased with the increasing difficulty of the listening situation [13, 14]. Some researchers allocated different frequency bands to different frequency regions, each containing several continuous-frequency bands [12, 15,16,17,18]; they found that temporal cues delivered at various frequencies contribute unequally to speech intelligibility [16,17,18,19,20,21,22,23,24,25]. The different frequency regions were presented to listeners to acquire recognition accuracies. The relative weights of temporal cues from various frequency regions could be calculated by permutation and combination of temporal information in different frequency regions. In this study, the frequency-weighting function of the temporal envelope was used to indicate the relative weights of the temporal envelope in different frequency regions [21, 25, 26].
Ardoint et al. extracted temporal envelope cues from 15 frequency bands across 70–7313 Hz and divided them into five regions. Consonant identification scores were obtained by presenting normal-hearing listeners with envelope cues from a single region and pairs of regions under clear conditions. The results suggested that temporal envelopes in the high-frequency region (1.8–7.3 kHz) contributed more than those of other regions toward English consonant recognition under clear listening conditions [17]. In contrast, another “hole” method (i.e., spectral removal method [16]) was used to study the weighting function of the temporal envelope in various frequency regions. Shannon et al. eliminated the information in low-, middle-, or high-frequency regions to simulate holes in the apical, middle, or basal regions of the cochlea. Recognition results suggested that the hole in the apical region (i.e., loss of temporal envelope cues in the low-frequency region) was more damaging than holes in the middle or basal regions [16]. These conflicting observations might result from the different spectra, cutoff frequency allocations, and methods used for extracting the envelope. In addition, Shannon et al. only investigated the effect of a single hole in the spectrum, which did not take into account the synthetic effects of nonadjacent frequency regions, and the negative effect of the hole was not obvious when the size was relatively small [16].
Kasturi et al. modified the setting of hole conditions in their study, considering the possibility that listeners could combine speech cues from nonadjacent frequency regions [18]. The speech materials spanning the frequency range from 300 to 5500 Hz were filtered into six frequency regions in a logarithmic fashion. The hole in the frequency spectrum was created by removing the information cues in one or two frequency regions. The intelligibility of speech with a single hole in different regions, or with two holes in disjointed or adjacent regions in the spectrum, was assessed. Then, the intelligibility of speech without holes was obtained as a baseline. Then, the frequency-weighting functions were derived based on a least-squares approach, which suggested that all frequency ranges contributed equally to consonant identification, whereas frequency regions located at 300–487, 791–1284, and 1284–2085 Hz, received the largest weights for vowel identification [18].
In contrast to English, which is a non-tonal language, Mandarin Chinese is a tonal language. This means that lexical tones are critical and essential features of the language, and changing pitches are associated with different meanings [27, 28]. There are four distinctive tone patterns in Mandarin Chinese, and these are characterized by the syllable-level fundamental frequency (F0) contours: high tone (tone 1), rising tone (tone 2), dipping tone (tone 3), and falling tone (tone 4) [29]. For instance, the Mandarin Chinese syllable /ma/ has four different tones: mā (Tone 1, high, 55(the numbers represent tone height); e.g., “mother”), má (Tone 2, rising, 35; e.g., “hemp”), mă (Tone 3, dipping, 214; e.g., “horse”), and mà (Tone 4, falling, 51; e.g., “scold”). It is well acknowledged that lexical tone plays a major role in the understanding of Mandarin speech [30,31,32,33]. Fu et al. found that tone, vowel, and consonant recognition contributed equally to Chinese sentence recognition [27]. Incorrect tone negatively influenced Mandarin sentence recognition in ways similar to misplaced or missing consonants and vowels in sentences [31, 33].
Recently, we studied the frequency-weighting functions of temporal envelope cues for Mandarin sentence recognition in a clear environment [25]. The temporal envelope cues of the original sentences were extracted across 80–7562 Hz and then distributed into five spectral frequency regions. The relative temporal envelope weights of the different regions were calculated after measuring the recognition scores under various conditions with different combinations of envelopes in different frequency regions. We found that temporal envelope cues in Region 1 (80–502 Hz) were of higher weight than those in any other region for Mandarin sentence perception [25], which differs from English speakers. This may be because Mandarin is a tonal language with different tones that convey different meanings [25]. Lexical tone recognition is crucial to Mandarin sentence perception and the role of F0 is essential in tone perception. Therefore, it is logical that Region 1 should exhibit a high relative weight in terms of Mandarin sentence perception [34,35,36,37]. However, the perceptual weighting strategy may differ depending on the listening environment.
Under clear listening conditions, the acoustic cues of speech are typically abundant and conducive to successful recognition. However, CI users encounter difficulties under noisy conditions [3]; this is a problem because most conversations in the real world occur in noisy environments. Several studies have addressed the perceptual weight shifts of envelope cues across various frequency regions for English recognitions in noise. However, no research to date has focused on the change of perceptual weights for Chinese Mandarin in noisy environments.
Speech-shaped noise (SSN) that matches the long-term average spectrum of recorded speech material is frequently applied in tests investigating the relative weights of temporal cues from various frequency regions [9, 19, 38]. This ensures that the signal-to-noise ratios (SNRs) are approximately equal at all frequencies [39, 40]. Using both the hole method and correlational method [41], Apoux and Bacon studied the relative temporal envelope weights of four frequency regions in SSN [19]. Under clear listening conditions, the hole method showed that the temporal envelope cues of all regions contributed similarly to consonant identification. However, under noisy conditions, both the hole method and correlational method indicated that the temporal envelope cues in the highest frequency region had the greatest importance [19]. Although low-rate syllabic modulations (< 4 Hz) are present across the frequency spectrum, mid- and/or high-frequency modulations (> 10 Hz) might carry unique speech information specific to the high-frequency regions [9, 19]. The shapes of the modulation spectra in adjacent frequency regions might explain this weight shift observed by Apoux and Bacon [19].
In addition, most realistic noises are modulated or fluctuating in level; therefore, fluctuating background noises (i.e., amplitude-modulated noise) are widely used in perception experiments [15, 42,43,44,45]. Amplitude modulation was found to interfere with the perception of temporal envelope cues, especially with low modulation rates [15, 46]. Fogerty also found that listeners placed higher perceptional weight on temporal envelope cues in the high-frequency region if speech was interrupted by noise at either a syllabic rate (4 Hz) or periodic rate (128 Hz) [9]. Thus, listeners would adapt their perceptual strategies, namely frequency-weighting functions, when communicating in adverse environments (i.e., those with noise) [9, 38]. Although there were evidences that white noise could severely impair the speech perception [47, 48], there has been no study focusing on the impacts of white noise on the relative weights of the temporal envelope from different frequency regions.
Investigating the perception strategy using envelope cues has important implications because the number of CI users who speak Chinese is growing rapidly, and CIs primarily convey envelop cues. Taking into account that the tonal character of Mandarin and the essential roles of F0 in lexical tone recognition, it is expected that temporal envelope cues from the low-frequency region, where F0 (typically ranges approximately from 100 to 350 Hz for Mandarin lexical tones) falls in [32, 49, 50], are more important for Mandarin sentence recognition under noisy conditions than in clear listening conditions. Furthermore, it is hypothesized that the weights of low-frequency region would differ under various kinds of noises. In this study, we tested these hypotheses by changing the number and location of holes in the spectrum. Then, we adopted a least-squares approach to determine the relative weights of temporal envelope cues across frequency regions in different noisy environments.