3.1.1 Fundamental frequency and pitch - Ian Howard

CHAPTER 3: ISSUES IN SPEECH FUNDAMENTAL FREQUENCY AND PERIODESTIMATION3.1 INTRODUCTIONThis chapter explores some of the issues and problems involved in the estimation ofspeech fundamental frequency. Firstly there is a discussion of what is meant by theterms fundamental frequency, fundamental period and pitch. Some aspects of humanpitch perception and their relationships to the requirements of algorithms that estimatespeech fundamental frequency are then discussed. Finally, there is a brief introductionto the basic approaches to speech fundamental frequency estimation by machine.3.1.1 Fundamental frequency and pitchBefore entering into an in depth discussion of the problems involved in estimatingspeech fundamental frequency, it is necessary to precisely define the problem. It is alsoenlightening to investigate the relationship between the parameters fundamentalfrequency, fundamental period and pitch.The automatic estimation of fundamental frequency of voiced speech excitation is oftenmisleadingly referred to as pitch analysis. Pitch properly refers to a percept rather thana parameter of speech production (McKinney, 1965), although the term pitch is oftenused in current technical literature to express both fundamental frequency andfundamental period. Pitch is a subjective phenomenon whereas fundamental frequencyis open to physical measurements. There is a relationship between pitch and frequency,but it is rather complex, although pitch is correlated with the physical feature offundamental frequency. Thus, when one is considering speech at the acoustic level, itis preferable to use the concept of fundamental frequency. It is also useful todistinguish between fundamental period estimation, implying a period-by-periodestimation process, and fundamental frequency estimation, which result from short-termanalyses.

supra-glottal system transfer function represents the characteristics of the vocal tract andradiation at the lips. The glottal wave is often modelled as a pulse train. However, inthis model ug(t) will be considered to be due to the sum of a pulse train pg(t) and aslowly varying function vg(t). The latter term is required because the volume velocityat the glottis does not always go to zero during each cycle of vibration. The functionpg(t) will be called the excitation pulse function. Each individual excitation pulse hasan associated time of occurrence, its excitation pulse time. In order to make thiscoincide with the principal excitation of the formant resonance in the vocal tract, theexcitation time is defined to occur at the time when the excitation pulse function reachesa zero value at the end of each glottal cycle (see figure 3.1). This time is also theinstant of glottal closure, and corresponds to the maximum positive gradient in alaryngograph waveform.3.2 FUNDAMENTAL PERIOD, FUNDAMENTAL FREQUENCY AND PITCH3.2.1 Definition of fundamental periodHess (1983) states that there are three possible ways to define To, the speechfundamental period.l] There is a long term definition, whereby To is the period duration of a signal that isstrictly periodic.21 There is a short-term definition, in which case To is due to the average elapsed timebetween successive excitations, somehow averaged over a specified short-term window.31 There is a period-by-period definition, where To is the elapsed time between twosuccessive period markers.Definition l] cannot be applied to speech, because it is a quasi-periodic signal and thisdefinition only applies for stationary signals. Definition 21 implies a short-term analysisof the speech signal, whereas 31 can be achieved by means of tirne-domain analysis of

perceived pitch. He states that "The extraction of fundamental frequency is in somerespect equivalent to extraction of virtual pitch. In a strict sense, however, thefrequency which corresponds to virtual pitch, and the fundamental frequency defined asthe largest common divisor of the partials) are in general not identical. ... Hence in theanalysis of auditory signals such as speech and music actually the extraction offundamental frequency is not the real aim but rather extraction of the frequency whichcorresponds to the virtual pitch".3.2.6 Difference limens for changes in frequencyThe smallest detectable change in the frequency of a stimulus is known as the frequencydifferent limen (DL) for frequency change. For synthetic speech stimuli the fundamentalfrequency DL has a value of about 0.3% to 0.5% of the fundamental frequency for thefundamental frequency range of male voice; that is over about 40Hz- 150Hz (Flanagan& Saslow, 1958). This is less than the difference limen for a pure tone within the samefrequency range, which correspond to about 3Hz (Zwicker & Feldkeller, 1967).Even if changes in fundamental frequency are audible, there are not necessarilylinguistically significant. The DL for linguistic significance is an order of magnitudelarger than the DL for audibility (McKinney, 1965). This is not that surprising if oneconsiders the fact that if the change is important, then it makes sense that is should beeasy for the auditory system to detect.3.2.7 The precision of speech productionHess (1983) states that unless the output from a speech fundamental frequencyestimation algorithm is to be used in synthesis applications (in which case the result ispresented to the ear), or for scientific investigations into vocal fold vibration, there isno need to estimate speech fundamental frequency to a higher accuracy than it can beproduced by the vocal apparatus. Various researchers have carried out measurementsof the cycle-by-cycle changes in location of the glottal pulses. Gill (1962) found thatthere are more variations in wave-shape than in length of the glottal excitation.

Lieberman (1963) found that for successive periods, there was a relative difference ofmore than 1% for 30% of all periods and there was a difference of more than 3% for10% of the periods. Similar results were found by Hollein et al. (1973) and Horii(1979). Horii found that the mean value of the jitter (the absolute difference in time)between two successive glottal pulses had a value of 51 microseconds at 98Hz and 24microseconds at 298 Hz. In addition, for 10% of the periods in the data used, the jitterexceeded 100 microseconds.These perturbations in the excitation are large compared to the frequency DLs forsteady-state stimuli, and are audible to a listener. They cannot be individuallydistinguished, but contribute to the sensation of naturalness (Schroeder& David, 1960).Their effect is quite different from that of quantization noise, as has been observed inthe context of speech synthesis (Holmes, 1976).3.3 PROBLEMS IN SPEECH FUNDAMENTAL PERIOD AND FREQUENCYESTIMATION3.3.1 Basic difficultiesThe determination of speech fundamental frequency is a difficult problem for manyreasons. Speech is a non-stationary signal. That is to say, its characteristics changegreatly as a function of time. One reason for this is that the shape of the vocal tract canchange rapidly even within the space of a single fundamental period. In addition, thevocal tract can give rise to a wide variety of speech sounds, with a multitude of differenttemporal structures. The glottal excitation of the vocal tract is often only quasi-periodic.This is particularly true in the case of creaky voice. In addition there are acousticinteractions between the excitation from the vocal folds and the vocal tract.3.3.2 Requirements for fundamental frequency estimation algorithmsThere have been many suggestions as to how the ideal fundamental frequency algorithmshould perform (Rabiner et al., 1976). It must be free from gross errors, which occur

when the frequency or period estimates deviate substantially from their true values. Itmust be able to retain the irregularity that exists in the vocal fold vibration. Thefundamental period or fundamental frequency values should be as accurate as possible.The algorithm must be able to respond rapidly enough to changes in the excitationperiod. There should be no voicing determination errors. The measurements should berobust over different speakers, noise and environmental conditions. The algorithmshould ideally require as little computation as possible, because this makes it easier (andpossibly cheaper) to implement in real-time and for non-real time applications it willneed less computer time to run (although this is becoming less important as time goeson, because of improvements in computer technology).The requirements for a fundamental frequency or period estimation algorithm are alldictated by characteristics of the speech production, speech perception, and the particularapplication for which the algorithm is intended. The human ear is capable of detectingsounds over a wider frequency range than the vocal apparatus can produce, and candetect changes in frequency that are far smaller than the smallest frequency perturbationsthat a speaker can intentionally generate.3.3.3 Sources of gross errors in fundamental period and period estimationThere are various reasons why a particular algorithm may generate gross errors. Firstly,when there are adverse signal conditions, which can occur when there is a strong firstformant, a rapid change in articulator positions or in the case of band-limited or noisyspeech. Secondly, when there is inadequate algorithm performance, perhaps because theanalysis window is too small in a short-term algorithm, or because of the absence ofsome feature used in the estimation process. Thirdly, because the algorithm is unable todeal satisfactorily with creaky voice. In this case, the inherent averaging in somealgorithms may cause erroneous output to be generated.In addition difficulties can arise due to the recording conditions. Quite often the speechsignal is degraded by amplitude and phase distortions, and background noise is almostalways present to some extent. It is particularly difficult to get algorithms to operate

1983 gives the range of 50Hz to 1800Hz to cover a bass to a soprano.For an individual speaker, the distribution of fundamental frequency depends upon theexperimental conditions. It is particularly relevant whether the speech was taken fromconversation or from read text. The frequency distributions from read text rarely exceedan octave range. Provided the distribution is plotted on a logarithmic scale, thisfundamental frequency distribution comes close to a normal distribution (Risberg, l96 1 ;Schultz-Colson, 1975)Algorithms that perform speech fundamental frequency estimation usually restrict theiroperation to a sub-range of the possible fundamental frequency values. A good workingrange for an algorithm is between 50Hz and 800Hz, because this covers the range ofmost adult conversational speech (Hess, 1983).3.3.5 Required measurement resolution and accuracyThe accuracy and resolution requirements for a fundamental frequency algorithm aredetermined by its intended applications. The human auditory sys tern is more sensitiveto changes in absolute frequency at low frequencies, and in general the noticeabledifference in frequency is proportional to frequency. The difference limen with respectto the fundamental frequency (DI,) for human listeners perhaps represents the ultimaterequired performance, which is typically 0.3-0.5% resolution of the fundamentalfrequency for steady state harmonic sounds. Most algorithms do not meet thisspecification. However, for most applications, less accuracy can be tolerated.The difference limen for linguistic significance is greater than for that of perception(McKinney, 1965). Thus for prosodic analysis, an accuracy of a few percent may beadequate.The required frequency (or time) resolution required is dependent upon the requiredapplication of the algorithm. For intonation training, a resolution of 3-4% will suffice(for example in a Voicscope, Abberton & Fourcin, 1973). There are also limits on the

esolution of fundamental frequency values that can be displayed with such schemes,due to the limited number of pixels available for the graphics display.Consideration to human frequency difference limens suggest that a frequency resolutionof 0.3%-0.4% of the fundamental frequency value would be ideally required by afundamental frequency or period estimation algorithm.Requirements for profoundly deaf EPT patientsThe required frequency resolution for the profoundly deaf patients for whom hightechnology signal processing hearing aids are intended is only about 1% of thefundamental frequency values within the male frequency range and poor above about200Hz, which is several times worse than for normal listeners.3.3.6 Accuracy limitations due to time quantization of sampled signalsThere is an intrinsic accuracy limit in time-domain fundamental frequency estimationalgorithms that operate using sampled digital signals which is due to the timequantization of the input signal. This introduces uncertainty into the location of anevent in time. For example, at a sampling frequency of IOkHz, it is only possible tolocate a time event to 1/10000 = 100 microseconds. For a fundamental freqiiency oflOOHz, this corresponds to an accuracy of 1%. At higher fundamental frequencies, thispercentage error increases still further. Even at 100Hz, this error is greater than theauditory DL for frequency change. The same problem arises for short-term analysisalgorithms that operate in the lag domain (for example auto-correlation, cepstralanalysis, etc).There is a similar problem in the case of frequency-domain analyzers. In this case, asampling rate of lOkHz and an analysis window of 1OOms (which is very long for theshort term analysis of speech) gives rise to a frequency resolution of 10Hz.Consequently, in this case it is rhe lower frequencies that give a proportionally largerquantization error. Thus there is a 10% error at 100Hz, and a 2% error at 500Hz. With

egard to this accuracy issue, Hess and Indefry point out (1987) that to reduce samplingaccuracies to 0.5% up to the fundamental frequency of 500Hz requires a samplingperiod of 10 microseconds.Many algorithms use interpolation at their outputs to improve the time or frequencyresolution of their estimates. Interpolation can easily be carried out in the case offrequency-domain algorithms and those employing short-term analysis. Interpolation ismore difficult to use in time-domain algorithms, although the accuracy of location ofpeaks and zero-crossings can be increased using interpolation. Another approach toreducing quantization errors is by smoothing the frequency estimates, although thisapproach is not always guaranteed to improve accuracy.3.3.7 Required maximum rate of change of speech fundamental periodIn regularly excited speech (not creak), the maximum rate of change of period lengthis typically taken to be a 10% to 15% change between successive periods (Reddy, 1967).The maximum rate of change of frequency of the normal voice source was found to beabout 1% per millisecond by Sundberg (1979). However, in voice qualities such ascreaky, as well as in pathological speech, there can be much larger change per periodthan this figure suggests.The maximum rate of change on fundamental period usually presents no problems totime-domain analyzers, because they operate on a period-by-period basis. However,they do put an upper time window limit on short-term analysis procedures of around20ms -30ms.3.4 CATEGORIZATION OF SPEECH FUNDAMENTAL FREQUENCY ESTIMATIONALGORITHMS3.4.1 Preliminary classification

McKinney (1965) states that a 'pitch' determination algorithm can be essentiallydecomposed into three stages. These are the pre-processor, the basic extractor and thepost-processor, as illustrated in figure 3.4. The main task of the measurement isperformed by the basic extractor stage. The main function of the pre-processor is oneof data reduction, and the emphasis of features in the input speech to facilitate theoperation of the basic extractor. The post-processor combines many functions, such aserror correction and the generation of output in the desired format.3.4.2 Types of algorithmThe techniques that have been developed to determine speech fundamental frequencyare broadly classified into four main groups by Hess, 1983; Those that operate in thetime-domain, those that operate over some short-term window of the speech, which hecalls short-term analysis, those which are hybrids of the first two, and finally those thatoperate by direct measurement of vocal fold activity. The is often no clear-cutdistinction between the first two types. It is important to understand what is meant bythe terms short-term, time-domain and frequency-domain.Time-domain algorithms employ direct measurements on the speech signal and involvelooking for temporal features in the speech pressure waveform (or in the filteredwaveform), such as local maxima and minima.Short-term analysis procedures use some form of transformation of the data within ashort (for example, 20rns) time window. The nature of the transformation depends onthe particular method used. The estimate obtained with such an approach consists ofa sequence of average fundamental period or frequency values obtained over the inputinterval.Frequency-domain algorithms make explicit 'frequency' estimates. There may be afrequency-domain interpretation to certain short-term operations which are implicit. Forexample, the auto-conelation technique can be implemented via a frequency-domainrepresentation.

The time-domain refers to analyses which use the same time base as the input speechsignal. A time-domain analyzer gives rise to an output signal that consists of a seriesof excitation markers that delineate period boundaries. Time-domain operation thusgenerally presumes the local definition of fundamental period and gives rise to a periodby-periodfundamental period estimates.The next chapter will examine some time-domain, short-term and laryngeal algorithmsin more detail.

Figure 3.1 Diagram showing voice source parameters.This illustrates; a) the excitation signal, and b) the corresponding period durations.(After McKinney, 1965).

Figure 3.2 Speech pressure waveform exhibiting two peaks per fundamental period.The speech is shown in trace A. The corresponding laryngograph waveform in shownin trace B. This situation arises when the fmt formant coincides with the secondharmonic in the excitation spectrum. This situation can lead to "doubling error" insimple fundamental period estimation algorithms. The speech is the vowel N from amale subject.

I BASIC EXTRACTOR (lPOSTPROCESSORFigure 3.4 Block diagram illustrating the basic stages involved in speech fundamentalfrequenc ylperiod, es tirnation.The pre-processing stage is involved with data reduction and extraction of importantfeatures of the speech signal. The basic extractor essentially performs the main taskestimation of period or frequency. Finally, the post-processing stage converts the outputfrom the basic extractor into a desirable format and may also perform error correctionand smoothing of the raw estimates.(Taken from Hess, 1983; After McKinney, 1965).

3.1.1 Fundamental frequency and pitch - Ian Howard

Create successful ePaper yourself

Delete template?

Save as template?