12.07.2015 Views

3.1.1 Fundamental frequency and pitch - Ian Howard

3.1.1 Fundamental frequency and pitch - Ian Howard

3.1.1 Fundamental frequency and pitch - Ian Howard

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CHAPTER 3: ISSUES IN SPEECH FUNDAMENTAL FREQUENCY AND PERIODESTIMATION3.1 INTRODUCTIONThis chapter explores some of the issues <strong>and</strong> problems involved in the estimation ofspeech fundamental <strong>frequency</strong>. Firstly there is a discussion of what is meant by theterms fundamental <strong>frequency</strong>, fundamental period <strong>and</strong> <strong>pitch</strong>. Some aspects of human<strong>pitch</strong> perception <strong>and</strong> their relationships to the requirements of algorithms that estimatespeech fundamental <strong>frequency</strong> are then discussed. Finally, there is a brief introductionto the basic approaches to speech fundamental <strong>frequency</strong> estimation by machine.<strong>3.1.1</strong> <strong>Fundamental</strong> <strong>frequency</strong> <strong>and</strong> <strong>pitch</strong>Before entering into an in depth discussion of the problems involved in estimatingspeech fundamental <strong>frequency</strong>, it is necessary to precisely define the problem. It is alsoenlightening to investigate the relationship between the parameters fundamental<strong>frequency</strong>, fundamental period <strong>and</strong> <strong>pitch</strong>.The automatic estimation of fundamental <strong>frequency</strong> of voiced speech excitation is oftenmisleadingly referred to as <strong>pitch</strong> analysis. Pitch properly refers to a percept rather thana parameter of speech production (McKinney, 1965), although the term <strong>pitch</strong> is oftenused in current technical literature to express both fundamental <strong>frequency</strong> <strong>and</strong>fundamental period. Pitch is a subjective phenomenon whereas fundamental <strong>frequency</strong>is open to physical measurements. There is a relationship between <strong>pitch</strong> <strong>and</strong> <strong>frequency</strong>,but it is rather complex, although <strong>pitch</strong> is correlated with the physical feature offundamental <strong>frequency</strong>. Thus, when one is considering speech at the acoustic level, itis preferable to use the concept of fundamental <strong>frequency</strong>. It is also useful todistinguish between fundamental period estimation, implying a period-by-periodestimation process, <strong>and</strong> fundamental <strong>frequency</strong> estimation, which result from short-termanalyses.


supra-glottal system transfer function represents the characteristics of the vocal tract <strong>and</strong>radiation at the lips. The glottal wave is often modelled as a pulse train. However, inthis model ug(t) will be considered to be due to the sum of a pulse train pg(t) <strong>and</strong> aslowly varying function vg(t). The latter term is required because the volume velocityat the glottis does not always go to zero during each cycle of vibration. The functionpg(t) will be called the excitation pulse function. Each individual excitation pulse hasan associated time of occurrence, its excitation pulse time. In order to make thiscoincide with the principal excitation of the formant resonance in the vocal tract, theexcitation time is defined to occur at the time when the excitation pulse function reachesa zero value at the end of each glottal cycle (see figure 3.1). This time is also theinstant of glottal closure, <strong>and</strong> corresponds to the maximum positive gradient in alaryngograph waveform.3.2 FUNDAMENTAL PERIOD, FUNDAMENTAL FREQUENCY AND PITCH3.2.1 Definition of fundamental periodHess (1983) states that there are three possible ways to define To, the speechfundamental period.l] There is a long term definition, whereby To is the period duration of a signal that isstrictly periodic.21 There is a short-term definition, in which case To is due to the average elapsed timebetween successive excitations, somehow averaged over a specified short-term window.31 There is a period-by-period definition, where To is the elapsed time between twosuccessive period markers.Definition l] cannot be applied to speech, because it is a quasi-periodic signal <strong>and</strong> thisdefinition only applies for stationary signals. Definition 21 implies a short-term analysisof the speech signal, whereas 31 can be achieved by means of tirne-domain analysis of


perceived <strong>pitch</strong>. He states that "The extraction of fundamental <strong>frequency</strong> is in somerespect equivalent to extraction of virtual <strong>pitch</strong>. In a strict sense, however, the<strong>frequency</strong> which corresponds to virtual <strong>pitch</strong>, <strong>and</strong> the fundamental <strong>frequency</strong> defined asthe largest common divisor of the partials) are in general not identical. ... Hence in theanalysis of auditory signals such as speech <strong>and</strong> music actually the extraction offundamental <strong>frequency</strong> is not the real aim but rather extraction of the <strong>frequency</strong> whichcorresponds to the virtual <strong>pitch</strong>".3.2.6 Difference limens for changes in <strong>frequency</strong>The smallest detectable change in the <strong>frequency</strong> of a stimulus is known as the <strong>frequency</strong>different limen (DL) for <strong>frequency</strong> change. For synthetic speech stimuli the fundamental<strong>frequency</strong> DL has a value of about 0.3% to 0.5% of the fundamental <strong>frequency</strong> for thefundamental <strong>frequency</strong> range of male voice; that is over about 40Hz- 150Hz (Flanagan& Saslow, 1958). This is less than the difference limen for a pure tone within the same<strong>frequency</strong> range, which correspond to about 3Hz (Zwicker & Feldkeller, 1967).Even if changes in fundamental <strong>frequency</strong> are audible, there are not necessarilylinguistically significant. The DL for linguistic significance is an order of magnitudelarger than the DL for audibility (McKinney, 1965). This is not that surprising if oneconsiders the fact that if the change is important, then it makes sense that is should beeasy for the auditory system to detect.3.2.7 The precision of speech productionHess (1983) states that unless the output from a speech fundamental <strong>frequency</strong>estimation algorithm is to be used in synthesis applications (in which case the result ispresented to the ear), or for scientific investigations into vocal fold vibration, there isno need to estimate speech fundamental <strong>frequency</strong> to a higher accuracy than it can beproduced by the vocal apparatus. Various researchers have carried out measurementsof the cycle-by-cycle changes in location of the glottal pulses. Gill (1962) found thatthere are more variations in wave-shape than in length of the glottal excitation.


Lieberman (1963) found that for successive periods, there was a relative difference ofmore than 1% for 30% of all periods <strong>and</strong> there was a difference of more than 3% for10% of the periods. Similar results were found by Hollein et al. (1973) <strong>and</strong> Horii(1979). Horii found that the mean value of the jitter (the absolute difference in time)between two successive glottal pulses had a value of 51 microseconds at 98Hz <strong>and</strong> 24microseconds at 298 Hz. In addition, for 10% of the periods in the data used, the jitterexceeded 100 microseconds.These perturbations in the excitation are large compared to the <strong>frequency</strong> DLs forsteady-state stimuli, <strong>and</strong> are audible to a listener. They cannot be individuallydistinguished, but contribute to the sensation of naturalness (Schroeder& David, 1960).Their effect is quite different from that of quantization noise, as has been observed inthe context of speech synthesis (Holmes, 1976).3.3 PROBLEMS IN SPEECH FUNDAMENTAL PERIOD AND FREQUENCYESTIMATION3.3.1 Basic difficultiesThe determination of speech fundamental <strong>frequency</strong> is a difficult problem for manyreasons. Speech is a non-stationary signal. That is to say, its characteristics changegreatly as a function of time. One reason for this is that the shape of the vocal tract canchange rapidly even within the space of a single fundamental period. In addition, thevocal tract can give rise to a wide variety of speech sounds, with a multitude of differenttemporal structures. The glottal excitation of the vocal tract is often only quasi-periodic.This is particularly true in the case of creaky voice. In addition there are acousticinteractions between the excitation from the vocal folds <strong>and</strong> the vocal tract.3.3.2 Requirements for fundamental <strong>frequency</strong> estimation algorithmsThere have been many suggestions as to how the ideal fundamental <strong>frequency</strong> algorithmshould perform (Rabiner et al., 1976). It must be free from gross errors, which occur


when the <strong>frequency</strong> or period estimates deviate substantially from their true values. Itmust be able to retain the irregularity that exists in the vocal fold vibration. Thefundamental period or fundamental <strong>frequency</strong> values should be as accurate as possible.The algorithm must be able to respond rapidly enough to changes in the excitationperiod. There should be no voicing determination errors. The measurements should berobust over different speakers, noise <strong>and</strong> environmental conditions. The algorithmshould ideally require as little computation as possible, because this makes it easier (<strong>and</strong>possibly cheaper) to implement in real-time <strong>and</strong> for non-real time applications it willneed less computer time to run (although this is becoming less important as time goeson, because of improvements in computer technology).The requirements for a fundamental <strong>frequency</strong> or period estimation algorithm are alldictated by characteristics of the speech production, speech perception, <strong>and</strong> the particularapplication for which the algorithm is intended. The human ear is capable of detectingsounds over a wider <strong>frequency</strong> range than the vocal apparatus can produce, <strong>and</strong> c<strong>and</strong>etect changes in <strong>frequency</strong> that are far smaller than the smallest <strong>frequency</strong> perturbationsthat a speaker can intentionally generate.3.3.3 Sources of gross errors in fundamental period <strong>and</strong> period estimationThere are various reasons why a particular algorithm may generate gross errors. Firstly,when there are adverse signal conditions, which can occur when there is a strong firstformant, a rapid change in articulator positions or in the case of b<strong>and</strong>-limited or noisyspeech. Secondly, when there is inadequate algorithm performance, perhaps because theanalysis window is too small in a short-term algorithm, or because of the absence ofsome feature used in the estimation process. Thirdly, because the algorithm is unable todeal satisfactorily with creaky voice. In this case, the inherent averaging in somealgorithms may cause erroneous output to be generated.In addition difficulties can arise due to the recording conditions. Quite often the speechsignal is degraded by amplitude <strong>and</strong> phase distortions, <strong>and</strong> background noise is almostalways present to some extent. It is particularly difficult to get algorithms to operate


1983 gives the range of 50Hz to 1800Hz to cover a bass to a soprano.For an individual speaker, the distribution of fundamental <strong>frequency</strong> depends upon theexperimental conditions. It is particularly relevant whether the speech was taken fromconversation or from read text. The <strong>frequency</strong> distributions from read text rarely exceedan octave range. Provided the distribution is plotted on a logarithmic scale, thisfundamental <strong>frequency</strong> distribution comes close to a normal distribution (Risberg, l96 1 ;Schultz-Colson, 1975)Algorithms that perform speech fundamental <strong>frequency</strong> estimation usually restrict theiroperation to a sub-range of the possible fundamental <strong>frequency</strong> values. A good workingrange for an algorithm is between 50Hz <strong>and</strong> 800Hz, because this covers the range ofmost adult conversational speech (Hess, 1983).3.3.5 Required measurement resolution <strong>and</strong> accuracyThe accuracy <strong>and</strong> resolution requirements for a fundamental <strong>frequency</strong> algorithm aredetermined by its intended applications. The human auditory sys tern is more sensitiveto changes in absolute <strong>frequency</strong> at low frequencies, <strong>and</strong> in general the noticeabledifference in <strong>frequency</strong> is proportional to <strong>frequency</strong>. The difference limen with respectto the fundamental <strong>frequency</strong> (DI,) for human listeners perhaps represents the ultimaterequired performance, which is typically 0.3-0.5% resolution of the fundamental<strong>frequency</strong> for steady state harmonic sounds. Most algorithms do not meet thisspecification. However, for most applications, less accuracy can be tolerated.The difference limen for linguistic significance is greater than for that of perception(McKinney, 1965). Thus for prosodic analysis, an accuracy of a few percent may beadequate.The required <strong>frequency</strong> (or time) resolution required is dependent upon the requiredapplication of the algorithm. For intonation training, a resolution of 3-4% will suffice(for example in a Voicscope, Abberton & Fourcin, 1973). There are also limits on the


esolution of fundamental <strong>frequency</strong> values that can be displayed with such schemes,due to the limited number of pixels available for the graphics display.Consideration to human <strong>frequency</strong> difference limens suggest that a <strong>frequency</strong> resolutionof 0.3%-0.4% of the fundamental <strong>frequency</strong> value would be ideally required by afundamental <strong>frequency</strong> or period estimation algorithm.Requirements for profoundly deaf EPT patientsThe required <strong>frequency</strong> resolution for the profoundly deaf patients for whom hightechnology signal processing hearing aids are intended is only about 1% of thefundamental <strong>frequency</strong> values within the male <strong>frequency</strong> range <strong>and</strong> poor above about200Hz, which is several times worse than for normal listeners.3.3.6 Accuracy limitations due to time quantization of sampled signalsThere is an intrinsic accuracy limit in time-domain fundamental <strong>frequency</strong> estimationalgorithms that operate using sampled digital signals which is due to the timequantization of the input signal. This introduces uncertainty into the location of anevent in time. For example, at a sampling <strong>frequency</strong> of IOkHz, it is only possible tolocate a time event to 1/10000 = 100 microseconds. For a fundamental freqiiency oflOOHz, this corresponds to an accuracy of 1%. At higher fundamental frequencies, thispercentage error increases still further. Even at 100Hz, this error is greater than theauditory DL for <strong>frequency</strong> change. The same problem arises for short-term analysisalgorithms that operate in the lag domain (for example auto-correlation, cepstralanalysis, etc).There is a similar problem in the case of <strong>frequency</strong>-domain analyzers. In this case, asampling rate of lOkHz <strong>and</strong> an analysis window of 1OOms (which is very long for theshort term analysis of speech) gives rise to a <strong>frequency</strong> resolution of 10Hz.Consequently, in this case it is rhe lower frequencies that give a proportionally largerquantization error. Thus there is a 10% error at 100Hz, <strong>and</strong> a 2% error at 500Hz. With


egard to this accuracy issue, Hess <strong>and</strong> Indefry point out (1987) that to reduce samplingaccuracies to 0.5% up to the fundamental <strong>frequency</strong> of 500Hz requires a samplingperiod of 10 microseconds.Many algorithms use interpolation at their outputs to improve the time or <strong>frequency</strong>resolution of their estimates. Interpolation can easily be carried out in the case of<strong>frequency</strong>-domain algorithms <strong>and</strong> those employing short-term analysis. Interpolation ismore difficult to use in time-domain algorithms, although the accuracy of location ofpeaks <strong>and</strong> zero-crossings can be increased using interpolation. Another approach toreducing quantization errors is by smoothing the <strong>frequency</strong> estimates, although thisapproach is not always guaranteed to improve accuracy.3.3.7 Required maximum rate of change of speech fundamental periodIn regularly excited speech (not creak), the maximum rate of change of period lengthis typically taken to be a 10% to 15% change between successive periods (Reddy, 1967).The maximum rate of change of <strong>frequency</strong> of the normal voice source was found to beabout 1% per millisecond by Sundberg (1979). However, in voice qualities such ascreaky, as well as in pathological speech, there can be much larger change per periodthan this figure suggests.The maximum rate of change on fundamental period usually presents no problems totime-domain analyzers, because they operate on a period-by-period basis. However,they do put an upper time window limit on short-term analysis procedures of around20ms -30ms.3.4 CATEGORIZATION OF SPEECH FUNDAMENTAL FREQUENCY ESTIMATIONALGORITHMS3.4.1 Preliminary classification


McKinney (1965) states that a '<strong>pitch</strong>' determination algorithm can be essentiallydecomposed into three stages. These are the pre-processor, the basic extractor <strong>and</strong> thepost-processor, as illustrated in figure 3.4. The main task of the measurement isperformed by the basic extractor stage. The main function of the pre-processor is oneof data reduction, <strong>and</strong> the emphasis of features in the input speech to facilitate theoperation of the basic extractor. The post-processor combines many functions, such aserror correction <strong>and</strong> the generation of output in the desired format.3.4.2 Types of algorithmThe techniques that have been developed to determine speech fundamental <strong>frequency</strong>are broadly classified into four main groups by Hess, 1983; Those that operate in thetime-domain, those that operate over some short-term window of the speech, which hecalls short-term analysis, those which are hybrids of the first two, <strong>and</strong> finally those thatoperate by direct measurement of vocal fold activity. The is often no clear-cutdistinction between the first two types. It is important to underst<strong>and</strong> what is meant bythe terms short-term, time-domain <strong>and</strong> <strong>frequency</strong>-domain.Time-domain algorithms employ direct measurements on the speech signal <strong>and</strong> involvelooking for temporal features in the speech pressure waveform (or in the filteredwaveform), such as local maxima <strong>and</strong> minima.Short-term analysis procedures use some form of transformation of the data within ashort (for example, 20rns) time window. The nature of the transformation depends onthe particular method used. The estimate obtained with such an approach consists ofa sequence of average fundamental period or <strong>frequency</strong> values obtained over the inputinterval.Frequency-domain algorithms make explicit '<strong>frequency</strong>' estimates. There may be a<strong>frequency</strong>-domain interpretation to certain short-term operations which are implicit. Forexample, the auto-conelation technique can be implemented via a <strong>frequency</strong>-domainrepresentation.


The time-domain refers to analyses which use the same time base as the input speechsignal. A time-domain analyzer gives rise to an output signal that consists of a seriesof excitation markers that delineate period boundaries. Time-domain operation thusgenerally presumes the local definition of fundamental period <strong>and</strong> gives rise to a periodby-periodfundamental period estimates.The next chapter will examine some time-domain, short-term <strong>and</strong> laryngeal algorithmsin more detail.


Figure 3.1 Diagram showing voice source parameters.This illustrates; a) the excitation signal, <strong>and</strong> b) the corresponding period durations.(After McKinney, 1965).


Figure 3.2 Speech pressure waveform exhibiting two peaks per fundamental period.The speech is shown in trace A. The corresponding laryngograph waveform in shownin trace B. This situation arises when the fmt formant coincides with the secondharmonic in the excitation spectrum. This situation can lead to "doubling error" insimple fundamental period estimation algorithms. The speech is the vowel N from amale subject.


I BASIC EXTRACTOR (lPOSTPROCESSORFigure 3.4 Block diagram illustrating the basic stages involved in speech fundamentalfrequenc ylperiod, es tirnation.The pre-processing stage is involved with data reduction <strong>and</strong> extraction of importantfeatures of the speech signal. The basic extractor essentially performs the main taskestimation of period or <strong>frequency</strong>. Finally, the post-processing stage converts the outputfrom the basic extractor into a desirable format <strong>and</strong> may also perform error correction<strong>and</strong> smoothing of the raw estimates.(Taken from Hess, 1983; After McKinney, 1965).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!