Perception Lecture Notes: Cochlear Implants and Speech Perception

Professor David Heeger

What you should know from this lecture

There is a lot more to know/learn about speech perception and language understanding. Some of that is reviewed in your textbook (which you should read). Don't have time to cover all of that material. We could have an entire course on this topic.

Speech analysis. To understand something about speech perception, one must begin with the elements of speech itself. Spoken languages consist of a sequence of discrete objects. The largest orthographic unit in English is the word, which you might think of as typically the thing you write surrounded by spaces. However, words can consist of more than one meaning-containing segment. For example "childhood" consists of "child" (young person) plus "hood" (the state of being). These subunits are called morphemes. Next, each morpheme can be split into a sequence of sounds, called phonemes. Phonemes are defined as the smallest unit that, if changed, can potentially change the word's meaning. For example, the "i" sound in "hit" is a phoneme, because if you change it to "a" you get "hat". Finally, phonemes correspond to produced speech sounds (although the correspondence is complicated, and is studied by phonologists and phoneticians). Phonemes are characterized primarily by the manner in which the sounds are produced.

The primary distinction is between vowels (produced with no air constriction) and consonants (which involve a partial or total constriction of air flow). All vowels are voiced, meaning that the vocal cords vibrate while producing the sound, so that the sound has a well-defined pitch and can be sung. Some consonants are voiced as well.

Vowels are described based on two main distinctions. First, one specifies the position of the highest part of the tongue in the mouth used in pronouncing the particular vowel sound (high, mid or low in the mouth, front/near the teeth or back). Second, the vowel sound may require the mouth to be wide open, or for the lips to be closed and rounded (compare how you say "ah" vs. "oh"). The last distinction for vowels is that some sounds involve a pair of vowel sounds pronounced smoothly in succession, known as a diphthong (e.g., the sound "ou" as in "house", which consists of "ah" followed by long "oo").

There are many different kinds of consonants, depending on the way in which they are articulated. Stop consonants (or plosives) involve a complete closure of the air stream, followed by an explosive release of air. These include the English consonants p/b, t/d, k/hard-g. Each pair in that list differs from the next in the place of articulation, i.e., where the air constriction occurs (between the lips, tongue to teeth, or tongue to soft palate). The two consonants in each pair differ in their voicing (b/d/g are voiced: the vocal chords vibrate as soon as the air is released). Another major class of consonants includes the fricatives and sibilants. These involve an incomplete closure of the air stream, resulting in a noisy rush of air through a small opening. Included are f/v, s/z, and sh/zh (hush vs. azure). Again, the two sounds in each pair differ as to whether they are voiced (v/z/zh) or not. The voiced fricatives/sibilants can be sung, as they have a well-defined pitch defined by the vibrations of the vocal chords. Other consonant sounds include the laterals (l/r), glides (y/w), and nasals (m/n).

A spectrogram is a graph of frequency vs. time. Each row is a frequency band. The intensity displayed at any point in a sound spectrogram indicates the amplitude of that frequency band (specified by the vertical position) at that time (specified by the horizontal position). A spectrogram is computed using the Fourier Transform. This is very similar to the processing done by the cochlea. Many of the phonetic distinctions discussed above are visible in a sound spectrogram. People have even attempted to train themselves to "read" the speech in spectrograms, although this turns out to be an extraordinarily difficult task. However, certain of the phonetic features discussed above are easily visible in spectrograms.

Vowel sounds are voiced, as well as some consonants that are extended in time (m, n, z, zh, v, etc.). For such sounds, the vocal cords are vibrating, producing a fundamental frequency, and the sound consists of that fundamental frequency along with harmomincs (integer multiples of  of the fundamental frequency). This voicing is visible in spectrograms as a series of thin, horizontal bands corresponding to the separate harmonics. These peaks are called formants. Vowels are voiced, unless you whisper. But, the manner of articulation (how you shape your tongue and lips) results in a filtering of the sound. This is what makes the formants different for each voxel sound.

Stop consonants are visible as a complete cessation of airflow, and hence of any sound. That is, there is a brief silence, visible as a vertical, blank band in the spectrogram. Sibilants and fricatives involve a rush of air through a near-constriction (with different places of articulation for different consonants). These are visible in the spectrogram as a wide band of frequencies (or "noise"), with different bands for different consonants ("s" is higher in frequency than "sh", for example).

Example formants (corresponding to vowels) and format transitions (corresponding to consonants) are shown in the following graphs.  The spectrograms (first artificial, then real) below are indicative of a two-phoneme utterance (a stop consonant followed by a vowel).

The formants (after the initial transient shifts) are the same in all examples, indicating that all three are the same vowel sound ("ah"). However, they are preceded by a formant transition. It turns out that the form of the formant transition (upward or downward for each formant, and where in the spectrum these transitions arise) is one of the spectral features that listeners use to discriminate which stop consonant preceded the vowel. Thus, the difference between "ba", "da", and "ga" is in the formant transitions. Some formant transitions are very brief (10-50 msec), like "ba" and "da".  Others are relatively long like "pa" and "ga". The length of the formant transition and the time at which voicing begins following the stop are indications of whether the stop consonant is voiced (b, d, g) or unvoiced (p, t, k). These distinctions have been studied perceptually by generating artifical spectrograms (such as those in the top half of the figure above) and asking listeners to identify the utterance. These artificial spectrograms were originally "played" on a machine called a vocoder.

Some example spectrograms follow for real speech. They are complex, but you can see formant and formant tranisitions. You can also see that the frequencies are sometimes high and sometimes low. This corresponds to the prosody of the speech; prosody is the extra-verbal indication in speech that indicates such things as word stress or whether a sentence is a statement, and imperative or a question.

Auditory cortex, Wernicke's Area and Broca's area. Primary auditory cortex is located laterally near top of temporal lobe (shown in blue).

Auditory cortex critical for speech perception and language comprehension. Aphasia refers to the collective deficits in language comprehension and production that accompany brain damage. Brain damage to primary auditory cortex and/or adjacent Wernicke's area causes a certain kind of aphasia, a disorder of language comprehension. Damage to Broca's area, located near motor cortex, causes a different kind of aphasia, a disorder of speech production.

Language learning impairment. Paula Tallal (at Rutgers University) has spent her career studying language learning disabilities, kids that have difficulty learning to understand and produce language. Tallal has demonstrated that these kids have difficulty with speech because they have deficits in the fast (10's of millisecond) temporal processing needed to distinguish brief formant transistions (like "ba" and "da").

In these experiments, subjects had to discriminate between two stimuli, either a high tone followed by a low tone or a low tone followed by a high tone.  If the tones are separated in time by more than half a second then both normal and language learning impaired (LLI) subjects have no problem performing the task (100% correct). But for shorter separations (shorter inter-stimulus intervals), LLI children show a dramatic deficit in performance.  Tallal believes that this is the cause of their language disability. Because they can't hear the differences between rapidly changing sounds, they can't discriminate one formant transition and another, so they have trouble understanding and producing speech.

Tallal and Michael Merzenich (a neurobiologist at UCSF) developed a software systems to help kids with language learning disabilities. The software is kind of like a computer game, but is really an auditory psychophysics experiment in disguise. The idea is to give the kids lots of practice making threshold discriminations between sounds. Over time with practice, their performance gets better, and results in better language comprehension and production. You can find out more about this company and their "FastForword" software at the Scientific Learning Corporation website.

Cochlear implants. The cochlear implant is a wonderful example of how we can take the results of basic research, our understanding of how the peripheral auditory system (cochlea, auditory) responds to sound, and put it to use.

Several electrodes mounted on a carefully designed support that is matched to the shape of the cochlea. The design of this support structure is critical because it places the electrodes very near the nerve cells. A computer decomposes a sound signal into its frequency components via Fourier transform and sends the separate frequency components to the corresponding electrodes. In other words, it computes a spectrogram mimicking the frequency decomposition performed by the cochlea. Then the implant transforms the spectrogram into a series of current pulses for each of the stimulating electrodes. This transformation into the current pulses is based on what we know about the coding of information in the auditory nerve. Both the temporal and place codes are important for signalling pitch. Both the nerve firing rates and the number of active neurons are important for signalling loudness. The goal is to accurately replicate the neural code that would naturally be communicated along the auditory nerve. Note the need to maintain proper timing of information to within 10's of msec for formant transitions.

How well does it work? In some patients, cochlear implants restore speech nearly perfectly. But that is not the case for most patients (at least at this time). When it doesn't work so well, it can be a detriment for some kids who can't hear well enough to succeed in the hearing community. Consequently, there is some controversy... Discussion...

SF Chronicle article (9/23/2001) about a deaf father, mother and daughter gained hearing together through cochlear implants.


Copyright © 2006, Department of Psychology, New York University
David Heeger