Converting an FFT to a spectogram

Goz picture Goz · Nov 5, 2009 · Viewed 10.4k times · Source

I have an audio file and I am iterating through the file and taking 512 samples at each step and then passing them through an FFT.

I have the data out as a block 514 floats long (Using IPP's ippsFFTFwd_RToCCS_32f_I) with real and imaginary components interleaved.

My problem is what do I do with these complex numbers once i have them? At the moment I'm doing for each value

const float realValue   = buffer[(y * 2) + 0];
const float imagValue   = buffer[(y * 2) + 1];
const float value       = sqrt( (realValue * realValue) + (imagValue * imagValue) );

This gives something slightly usable but I'd rather some way of getting the values out in the range 0 to 1. The problem with he above is that the peaks end up coming back as around 9 or more. This means things get viciously saturated and then there are other parts of the spectrogram that barely shows up despite the fact that they appear to be quite strong when I run the audio through audition's spectrogram. I fully admit I'm not 100% sure what the data returned by the FFT is (Other than that it represents the frequency values of the 512 sample long block I'm passing in). Especially my understanding is lacking on what exactly the compex number represents.

Any advice and help would be much appreciated!

Edit: Just to clarify. My big problem is that the FFT values returned are meaningless without some idea of what the scale is. Can someone point me towards working out that scale?

Edit2: I get really nice looking results by doing the following:

size_t count2   = 0;
size_t max2     = kFFTSize + 2;
while( count2 < max2 )
{
    const float realValue   = buffer[(count2) + 0];
    const float imagValue   = buffer[(count2) + 1];
    const float value   = (log10f( sqrtf( (realValue * realValue) + (imagValue * imagValue) ) * rcpVerticalZoom ) + 1.0f) * 0.5f;
    buffer[count2 >> 1] = value;
    count2 += 2;
}

To my eye this even looks better than most other spectrogram implementations I have looked at.

Is there anything MAJORLY wrong with what I'm doing?

Answer

McBeth picture McBeth · Nov 5, 2009

The usual thing to do to get all of an FFT visible is to take the logarithm of the magnitude.

So, the position of the output buffer tells you what frequency was detected. The magnitude (L2 norm) of the complex number tells you how strong the detected frequency was, and the phase (arctangent) gives you information that is a lot more important in image space than audio space. Because the FFT is discrete, the frequencies run from 0 to the nyquist frequency. In images, the first term (DC) is usually the largest, and so a good candidate for use in normalization if that is your aim. I don't know if that is also true for audio (I doubt it)