Acoustical Society of America and Acoustical Society of Japan Third Joint Meeting, Honolulu, Dec 2-6, 1996.



Tracking and Glimpsing Speech in Noise:
Role of Fundamental Frequency

Peter F. Assmann

School of Human Development, GR41
The University of Texas at Dallas
Box 830688, Richardson, Texas 75083-0688
Email: assmann@utdallas.edu
Web address: http://wwwpub.utdallas.edu/~assmann/

Abstract

Listeners with normal hearing can communicate successfully in noisy and reverberant conditions that distort temporal and spectral cues in the speech signal. Recent work has sought to identify auditory grouping processes that contribute to this ability. One factor that has received attention is the fundamental frequency of voicing (F0). During voiced speech the pulsing of the vocal folds gives rise to a pattern of amplitude modulation in the waveform and harmonicity in the spectrum. When two or more voices compete for the attention of the listener, momentary differences in F0 can contribute to voice segregation in at least three ways. First, periodicity or harmonicity in the composite signal provides a basis for grouping together signal components that stem from a target voice. Second, waveform interactions generate moment-to-moment fluctuations in signal-to-noise ratio that enable listeners to "glimpse" the acoustic features of the target voice. Third, time-varying changes in F0 provide a basis for tracking properties of the voice over time.

Introduction

In everyday life, listeners are often faced with the task of separating a voice from a background of competing voices. Speech has several design features that facilitate the process of attending selectively to a single talker. In this talk I will focus on one of these, the fundamental frequency of voicing. The fundamental frequency is determined by the rate of modulation of the vocal folds during voiced speech. It gives rise to periodicity in the waveform and harmonicity in the spectrum. These acoustic regularities are responsible for the perception of voice pitch, and contribute in several ways to the perceptual grouping of speech components.

Fundamental frequency is a robust feature of the speech signal. It provides a basis for grouping speech components across frequency and over time, to signal their origin in the same larynx and vocal tract. Fundamental frequency variation also underlies the prosodic structure of speech, and helps listeners to select between alternative interpretations of utterances when they are partially masked by other sounds.

Double vowels and "double sentences"

One source of evidence for the contribution of F0 comes from studies of the perception of double vowels. Scheffers (1983) showed that two vowels played simultaneously on different fundamentals were easier to understand than two vowels on the same F0. Scheffers' double vowel paradigm has been used extensively to study perceptual grouping and segregation, because it provides a basis for controlling for co-variation among a large number of acoustic parameters of natural speech. It is reasonable, therefore, to ask whether the perceptual processes that are revealed in studies of double vowels are reflective and representative of speech communication in general. One of my goals is to show that there are some interesting differences in results obtained with double vowels and double sentences. I'll review the major findings of these studies and then present some audio examples of double vowels and double sentences.

Slide 1: Double vowel identification results

(Assmann & Summerfield, 1994)

The first slide summarizes the results of a double vowel experiment by Assmann & Summerfield (1994). The stimuli were pairwise combinations of five American English vowels, /i/ ("EE"), /a/ ("AH"), /u/ ("OO"), /ae/ ("AE"), and /3^/ ("ER"). There are four main results:

Slide 2: Double sentence identification results

(Brokx & Nooteboom, 1982)

The second slide summarizes the findings of an experiment by Brokx and Nooteboom (1982) which used connected speech: double sentences rather than double vowels. Brokx and Nooteboom analyzed natural speech using linear predictive coding (LPC). Their LPC vocoder allowed them to artificially modify the characteristics of the excitation source to produce synthesized, monotone versions of the sentences. They then varied the difference in fundamental frequency between the target sentence and a continuous speech masker.

Identification accuracy was lowest when the target and masker had the same F0, and gradually improved as a function of increasing difference in F0. Identification accuracy was lower when the two voices were exactly an octave apart, a condition where every second harmonic of the higher-pitched voice overlaps with a harmonic of the lower F0. Compared to double vowels, the identification function for double sentences did not flatten out between 1 and 2 semitones, but instead showed the largest increase between 2 and 3 semitones, where double vowels generally do not show any improvement.

Slide 3: Double sentence identification results

(Bird & Darwin, 1994)

These results have recently been replicated and extended by Bird and Darwin, who used a vocoder technique to create monotone versions of short declarative sentences consisting of mainly voiced sounds. They presented the sentences in pairs, with one long masker sentence and a short target sentence in each pair. The results can be summarized as follows:

Why do we see these differences between double vowels and double sentences? Culling and Darwin (1994) suggested that part of the improvement in double-vowel identification might be attributed to waveform interactions rather than perceptual grouping based on F0. Waveform interactions play a role when corresponding resolved harmonics of the two vowels are close together in frequency. For example, when one F0 is 100 Hz and the other is 106 Hz, the fundamentals beat together to generate a pattern of rising and falling amplitude with a period of 6 Hz. Their second harmonics beat together with a period of 12 Hz, and so on. Culling and Darwin showed that the benefits of a difference in F0 are not eliminated when double vowels are synthesized with their even and odd harmonics on different F0 s. They also showed that performance was unaffected by synthesizing all of the high-frequency harmonics of the two vowels on the same F0. Their results are consistent with the idea that the pattern of beating created by low-frequency resolved harmonics can highlight features of spectral shape that signal the presence of a particular vowel in the composite stimulus.

Slide 4: Double vowels: Repeated vs successive presentation

(Assmann & Summerfield, 1994)

Similar conclusions can be drawn from a double-vowel study we carried out to explore possible reasons why F0 provides smaller benefits when the stimuli are brief. In the condition labeled "Repeated" (green circles), we presented the same 50-ms segment four times in rapid succession. In the condition labeled "Successive" (red circles), we played four different 50-ms segments, extracted from a longer stimulus. Identification improved when listeners had the opportunity to listen to different segments, but not when they had repeated opportunities to listen to the same segment. The advantage appeared only when the F0 difference was less than 2 semitones, consistent with Culling and Darwin's waveform interaction hypothesis. The results provide further evidence that listeners do not process the stimulus as a single time snapshot, but instead perform a temporal analysis to determine where the spectra of the constituent vowels are best defined.

Culling and Darwin (1994) developed a model of double-vowel identification that performs a filter bank analysis followed by a brief, sliding temporal window. They developed a vowel classifier that uses a glimpsing strategy to scan for regions of the waveform where the spectral shapes of the two vowels are best defined. Their modeling work suggests that part of the improvement in identification accuracy is due to a glimpsing process that exploits the pattern of beating, rather than across-frequency grouping based on F0.

Slide 5: Filter bank analysis of two 50-ms segments of a double vowel

This slide shows waterfall plots of the output of a Gammatone filter bank (Patterson et al., 1992) in response to two different 50 ms segments of the vowel pair /i/ (F0=100 Hz) plus /a/ (F0=103 Hz). The arrows on the left show the center frequencies of the formants (F1, F2, F3) of the /i/; the arrows on the right show the formant frequencies of /a/. The initial 50-ms segment of the stimulus is shown in the upper display; the 150-200 ms segment is shown at the bottom. These two segments differed mainly in terms of the identifiability of the /i/ component. Pitch-pulse asynchrony in the high-frequency region and the pattern of beating between pairs of resolved harmonics of the two vowels in the low-frequency region provide two sources of evidence that may contribute to the identification of the vowels. It is probable that the low-frequency cues make a larger contribution.

We have recently extended Culling and Darwin's model to show that it can account for the effects of adjacent formant transitions on the identification of double vowels embedded in CVC syllables (Assmann, 1996). However, the model underestimates the contribution of F0 when the difference is 2 semitones or greater. We therefore combined the glimpsing strategy for vowel identification with the autocorrelation model of perceptual grouping proposed by Meddis and Hewitt (1992). This hybrid model predicts the pattern of listeners responses quite well, suggesting that at least two distinct mechanisms contribute to the identification of double vowels. One is sensitive to waveform interactions. This mechanism operates mainly when the F0 difference is 1 semitone or less. The second mechanism involves a form of perceptual grouping that exploits across-channel comparisons of the pattern of periodicity in the waveform.

Pitch and vowel identification

Double vowels often give the subjective impression that two (synthetic) voices are producing different vowels on different pitches (listen to demo #1 for an example). Dwayne Paschall and I have recently carried out a study of the pitches evoked by double vowels.

Slide 6: Pitch matching results for 200 ms double vowels

(Assmann & Paschall, 1996)

This slide shows aggregate histograms of pitch matches from five listeners and 25 vowel pairs. The arrows show the frequencies of the two F0 s that were present. Listeners assigned two matches to every stimulus. Matches to the dominant pitch are shown with the red bars; non-dominant pitch matches are shown with blue bars. We found that double vowels with small F0 differences evoke a single pitch and produce unimodal matching histograms. Vowel pairs with an F0 difference of four semitones (bottom panel) evoked two clear pitches and resulted in bimodal histograms. In between the histograms are more variable, forming a broader distribution that favors the higher F0. One interesting observation is that the dominant pitch is consistently associated with the higher F0. Compared to the 200-ms stimuli, we found that 50-ms double vowels generate weaker pitches, and do not show the dominance of the higher fundamental.

Slide 7: Correlation of pitch intervals and identification accuracy

Double vowels with small differences in F0 generally do not evoke clear pitches. This outcome is consistent with the idea that waveform interactions, rather than F0-guided segregation, may underlie the F0 effect for vowel identification. But what about larger differences in F0?

This slide shows the relationship between pitch judgments and vowel identification. The bars in the slide show the average frequency distance between the two pitch matches assigned to each stimulus. Double vowels that evoke two pitches should produce large frequency intervals, while stimuli that evoke a single pitch should be close to zero. Pitch intervals (Hz) are referenced to the right hand axis. Vowel identification results are shown by the filled circles, and are referenced to the left hand axis.

At first glance, the pitch and vowel identification functions are similar, in that they both increase as a function of the difference in F0. However, the vowel identification scores increase sharply over the first semitone, where the pitch intervals increase only by a small amount. Second, vowel identification scores reach a plateau between 1 and 2 semitones, while pitch intervals continue to increase up to 4 semitones.

The correlation between pitch intervals and vowel identification accuracy is shown above each bar. For small differences in F0, pitch and vowel identification were uncorrelated. However, significant correlations appeared for the larger F0 differences of 2 and 4 semitones. The relationship between vowel identification and pitch emerges in conditions where listeners hear two distinct pitches rather than one. In addition, we found that the decline in accuracy when the stimulus duration was shortened from 200 ms to 50 ms was significantly correlated with the reduction in pitch interval.

Glimpsing and grouping?

I have described two ways that fundamental frequency can contribute to the perceptual segregation of competing voices. One is based on waveform interactions and involves a glimpsing mechanism. It contributes mainly when the F0 separation is 1 semitone or less. The other is sensitive to harmonicity or periodicity in the signal, and contributes when the F0 difference is large enough to evoke the percept of two distinct pitches.

Bird and Darwin's results suggest that waveform interactions do not contribute substantially in double sentences. In their review of auditory grouping, Darwin and Carlyon (1995) proposed that the contribution of waveform interactions in double vowels is enhanced by the long stimulus duration (200 ms is longer than most vowels in natural speech), and possibly by their steady-state nature. We have presented additional evidence consistent with this view, but suggest that F0-based grouping may contribute to double-vowel identification when the difference in F0 is two semitones or greater.

Up to this point I have focused on differences in F0, but F0 is rarely held constant in natural speech. Fundamental frequency variation probably contributes to voice separation in several ways. For example:

Slide 7: Waveform, spectrogram, & F0s of mixed sentences

This slide shows the waveform, spectrogram and fundamental frequencies of two sentences that are presented concurrently. In this example, the fundamental frequencies of the two sentences occupy a similar range, and cross in several places. As in the audio examples below, they were both produced by the same (male) talker. The two sentences have similar syntactic and prosodic structure. Brokx and Nooteboom (1982) noted that overlap in the F0 trajectory can lead to perceptual fusion and impaired identification.

Audio Demos

The following examples of double vowels and double sentences are included to illustrate the role of F0 in the perceptual separation of competing voices.

1. Double vowels

Notice the tendency for the vowels to fuse when they have the same fundamental, and to appear as two separate voices when the F0 difference is large.

2. Double sentences, monotone pitch

Next we hear examples of synthesized sentences on a monotone pitch, similar to those used in the double-sentence studies. The sentences were created with a channel vocoder based on a gammatone filter bank.

3. Double sentences, natural pitch

Now we hear examples of synthesized sentences that preserve the natural F0 contour. The pitch of one of the sentences is scaled upwards along the semitone scale, but the natural variation with time has been preserved.

4. Double sentences, noise excitation

In the final example the excitation source is changed from glottal pulsing to white noise. These sentences sound like whispered versions of the originals. The mixture of two unvoiced sentences is hard to understand, illustrating the importance of the harmonicity and periodicity that accompanies voiced speech.

Summary and conclusions

Fundamental frequency contributes in several ways to the perceptual segregation of competing voices. As stimuli for studying perceptual grouping of speech, we have seen that double vowels and double sentences may engage different mechanisms, one based on grouping, the other based on waveform interactions and glimpsing. While it is possible that waveform interactions contribute only under artificial laboratory conditions, it is likely that glimpsing processes have a more general role in the separation of speech from noise. Finally, the audio examples show that double sentences that retain natural variations in fundamental frequency provide additional information that cannot be investigated with simpler stimuli such as double vowels or even monotone double sentences.

References

Assmann, P.F. and Paschall, D.D. (1996). Pitches of concurrent vowels. Submitted to J. Acoust. Soc. Am.

Assmann, P. F. (1996). Modeling the perception of concurrent vowels: Role of formant transitions. J. Acoust. Soc. Am. 100: 1141-1152.

Assmann, P.F. (1995). The role of formant transitions in the perception of concurrent vowels. J. Acoust. Soc. Am. 97: 575-584.

Assmann, P. F. and Summerfield, Q. (1994). The contribution of waveform interactions to the perception of concurrent vowels. Journal of the Acoustical Society of America 95: 471-484.

Bregman, A.S. (1990). Auditory scene analysis. (MIT Press, Cambridge, MA).

Brokx, J.P.L., and Nooteboom, S.G. (1982). Intonation and the perception of simultaneous voices. Journal of Phonetics 10, 23-26.

Culling, J.F. and Darwin, C.J. (1994). Perceptual and computational separation of simultaneous vowels: cues from low-frequency beating. J. Acoust. Soc. Am. 95: 1559-1569.

Darwin, C.J. and Carlyon, R.P. (1995). Auditory Grouping. Handbook of Perception and Cognition, Volume 6: Hearing, edited by B.C.J. Moore (Academic, London).

de Cheveigné, A., Kawahara, H., Tsuzaki, M., and Aikawa, K. (1996) "Concurrent vowel identification I: effects of relative level and F0 difference", J. Acoust. Soc. Am., in press.

McKeown, J.D. and Patterson, R.D. (1995). The time course of auditory segregation: Concurrent vowels that vary in duration. Journal of the Acoustical Society of America 98: 1866-1877.

Meddis, R., and Hewitt, M. (1992). Modelling the identification of concurrent vowels with differ ent fundamental frequencies. Journal of the Acoustical Society of America 91: 233-245.

Patterson, R.D., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C. and Allerhand, M.H. (1992). Complex sounds and auditory images. In Auditory Physiology and Perception, edited by Y. Cazals, L. Demany, & K. Horner, (Pergamon Press, Oxford), pp. 429-446.

Scheffers, M.T.M. (1983). Sifting Vowels: Auditory Pitch Analysis and Sound Segregation. Ph.D. Thesis (Rijksuniversiteit te Groningen, The Netherlands).

Related work