Cepstral Analysis for pitch detection

P i picture P i · Jan 3, 2011 · Viewed 17k times · Source

I'm looking to extract pitches from a sound signal.

Someone on IRC just explained to me how taking a double FFT achieves this. Specifically:

  1. take FFT
  2. take log of square of absolute value (can be done with lookup table)
  3. take another FFT
  4. take absolute value

I am attempting this using vDSP

I can't understand how I didn't come across this technique earlier. I did a lot of hunting and asking questions; several weeks worth. More to the point, I can't understand why I didn't think of it.

I am attempting to achieve this with vDSP library. It looks as though it has functions to handle all of these tasks.

However, I'm wondering about the accuracy of the final result.

I have previously used a technique which scours the frequency bins of a single FFT for local maxima. When it encounters one, it uses a cunning technique (the change in phase since the last FFT) to more accurately place the actual peak within the bin.

I am worried that this precision will be lost with this technique I'm presenting here.

I guess the technique could be used after the second FFT to get the fundamental accurately. But it kind of looks like the information is lost in step 2.

As this is a potentially tricky process, could someone with some experience just look over what I'm doing and check it for sanity?

Also, I've heard there is an alternative technique involving fitting a quadratic over neighbouring bins. Is this of comparable accuracy? If so, I would favour it, as it doesn't involve remembering bin phases.

So, questions:

  • does this approach makes sense? Can it be improved?
  • I'm a bit worried about the "log square" component; there seems to be a vDSP function to do exactly that: vDSP_vdbcon. However, there is no indication it precalculates a log-table -- I assume it doesn't, as the FFT function requires an explicit pre-calculation function to be called and passed into it. And this function doesn't.
  • Is there some danger of harmonics being picked up?
  • is there any cunning way of making vDSP pull out the maxima, biggest first?
  • Can anyone point me towards some research or literature on this technique?

  • the main question: Is it accurate enough? Can the accuracy be improved? I have just been told by an expert that the accuracy IS INDEED not sufficient. Is this the end of the line?

Pi

PS I get SO annoyed when I want to create tags, but cannot. :| I have suggested to the maintainers that SO keep track of attempted tags, but I'm sure I was ignored. We need tags for vDSP, accelerate framework, cepstral analysis

Answer

Jeremy Salwen picture Jeremy Salwen · Aug 27, 2011

Okay, let's go through one by one:

I'm looking to extract pitches from a sound signal.

Although I am not an expert and have had minimal formal training, I think I know the best answer to this problem. I've done a lot of searching, reading, and experimenting over the past few years. My consensus is that the autocorrelation method is by far the best pitch detector in terms of the tradeoff between accuracy, complexity, noise robustness, and speed. Unless you have some very specific circumstances, I would almost always recommend using autocorrelation. More on this later, let me answer your other questions.

What you describe is "cepstral analysis" which is a method mainly used for the extraction of pitch from speech. Cepstral analysis relies entirely on the plentifulness and strength of the overtones of your signal. If for example, you were to pass a pure sine wave through cepstral analysis, you would get terrible results. However, for speech, which is a complex signal, there is a large number of overtones. (overtones, by the way, are elements of the signal which are oscillating at multiples of the fundamental frequency i.e. the pitch we perceive). Cepstral analysis can be robust in detecting speech with a missing fundamental frequency. That is, suppose you plotted the function sin(4x)+sin(6x)+sin(8x)+sin(10x). If you look at that, it is clear that it has the same frequency as the function sin(2x). However, if you apply fourier analysis to this function, the bin corresponding to sin(2x) will have zero magnitude. Thus this signal is consider to have a "missing fundamental frequency", because it does not contain the sinusoid of the frequency which we consider it to be. Thus simply picking the biggest peak on the fourier transform will not work on this signal.

I have previously used a technique which scours the frequency bins of a single FFT for local maxima. when it encounters one, it uses a cunning technique (the change in phase since the last FFT) to more accurately place the actual peak within the bin.

What you are describing is the phase vocoder technique to more accurately measure the frequency of a given partial. However, the basic technique of picking out the biggest bin is going to cause you problems if you use a signal with a missing or weak fundamental frequency component.

I am worried that this precision will be lost with this technique I'm presenting here.

First of all, remember that the phase vocoder technique only more accurately measures the frequency of a single partial. It ignores the information contained in the higher partials about the fundamental frequency. Second of all, given a decent FFT size, you can get very good accuracy using peak interpolation. Someone else here has pointed you towards parabolic interpolation. I also would suggest this.

If you parabolically interpolate the FFT of a 4098 sample block of data at 44100 Hz, with a pitch about 440 hz, that will mean it will be between the 40th (430.66 Hz) and 41st (441.430664064) bin. Assuming this paper is approximately correct in the general case, it says parabolic interpolation increases resolution by more than one order of magnitude. This leaves the resolution at at least 1 Hz, which is the threshold of human hearing. In fact, if you use an ideal Gaussian window, parabolic interpolation is exact at the peaks (That's right, exact. remember, however, that you can never use a true Gaussian window, because it extends forever in both directions.) If you are still worried about getting higher accuracy, you can always pad the FFT. This means adding zeros to the end of the FFT before transforming. It works out that this is equivalent to "sinc interpolation" which is the ideal interpolation function for frequency limited signals.

I guess the technique could be used after the second FFT to get the fundamental accurately. But it kind of looks like the information is lost in step 2.

That is correct. The phase vocoder technique relies on the fact that sequential frames are connected and have a specific phase relationship. However, the log magnitude of the FFT of sequential frames does not show the same relationship in terms of phase, thus it would be useless to use this transform for the second FFT.

  • does this approach makes sense? Can it be improved?

Yes and yes, I will elaborate on the improvement in my bit on autocorrelation at the end.

  • I'm a bit worried about And the log square component; there seems to be a vDSP function to do exactly that: vDSP_vdbcon however, there is no indication it precalculates a log-table -- I assume it doesn't, as the FFT function requires an explicit pre-calculation function to be called and passed into it. and this function doesn't.

I don't know the specifics of the vDSP library, sorry.

  • Is there some danger of harmonics being picked up?

In your original phase-vocoder peak picking technique? yes. With the cepstral method? no, not really, the whole point is that it considers all the harmonics to get its frequency estimate. For exmaple, let's say our freqency is 1. Our overtones are 2,3,4,5,6,7,8,9,etc We would have to take out all of the odd harmonics, i.e. leave 2,4,6,8, etc, and remove the fundamental frequency before it would start to be confused with one of its overtones.

  • is there any cunning way of making vDSP pull out the maxima, biggest first?

Don't know vDSP, but in the general case, you usually just iterate over all of them and keep track of the biggest.

  • Can anyone point me towards some research or literature on this technique?

The link P. i gave you in a comment seemed like a good one.

Also, this website offers an incredibly in-depth and wonderfully broad explanation of DSP topics, including all sorts of pitch extraction, manipulation, etc, in both a theoretical and practical way. (this is a more general link to an index on the site). I always find myself coming back to it. Sometimes it can be a bit overwhelming if you jump into the middle of it, but you can always follow every explanation back to the basic building blocks.

Now for autocorrelation. Basically the technique is this: You take your (windowed) signal and time delay it different amounts. Find the amount which matches up best with your original signal. That is the fundamental period. It makes a lot of theoretical sense. You are hunting for the repetitive parts of your signal.

In practice, taking the correlation with all these time delayed copies of the signal is slow. It is usually implemented in this way instead (which is mathematically equivalent):

Zero-Pad it to double its original length.Take the FFT. Then replace all the coefficients with their square magnitude, except for the first, which you set to 0. Now take the IFFT. Divide every element by the first one. This gives you the autocorrelation. Mathematically, you are using the circular convolution theorem (look it up), and using zero-padding to convert a linear convolution problem into a circular convolution one, which can be efficiently solved.

However, be careful about picking the peak. For very small delays, the signal will match up with itself very well, simply because it is continuous. (I mean, if you delay it zero, it correlates perfectly with itself) Instead, pick the largest peak after the first zero-crossing. You can parabolically interpolate the autocorrelation function as well just as with other techniques to get much more accurate values.

This by itself will give you very good pitch detection by all criteria However, you might sometimes encounter a problem with pitch halving and pitch doubling. Basically the problem is that if a signal is repetitive every 1 second, it is also repetitive every two seconds. Similarly, if it has a very strong overtone, you might get pitch halving. So the biggest peak might not always be the one you want. A solution to this problem is the MPM algorithm by Phillip McLeod. The idea is this:

Instead of picking the biggest peak, you want to pick the first peak that is large enough to be considered. How do you determine if a peak is large enough to be considered? If it is at least as high as A*the largest peak, where A is some constant. Phillip suggests a value of A around 0.9 I think. Actually the program he wrote, Tartini, allows you to compare several different pitch detection algorithms in real time. I would strongly suggest downloading it and trying it out (it implements Cepstrum, straight autocorrelation, and MPM): (if you have trouble building, try the instructions here.

One last thing I should note is about windowing. In general, any smooth window will do. Hanning window, Hamming window, etc. Hopefully you should know how to window. I would also suggest doing overlapped windows if you want more accurate temporal measurements.

By the way, a cool property of the autocorrelation is that if the frequency is changing linearly through the windowed section you are measuring, it will give you the correct frequency at the center of the window.

One more thing: What I described is called the biased autocorrelation function. This is because for higher time lags, the overlap between the original signal and the time lagged version becomes less and less. For example, if you look at a window of size N which has been delayed N-1 samples, you see that only one sample overlaps. So the correlation at this delay is clearly going to be very close to zero. You can compensate for this, by diving each value of the autocorrelation function by the number of samples overlap to get it. This is called the unbiased autocorrelation. However, in general, you will get worse results with this, as the higher delay values of the autocorrelation are very noisy, as they are based on only a few samples, so it makes sense to weigh them less.

If you're looking for more information, as always, google is your friend. Good search terms: autocorrelation, pitch detection, pitch tracking, pitch extraction, pitch estimation, cepstrum, etc.