I'm using FMOD to develop an application which would immediately start playing the recording of the next/previous sentence exactly from its beginning in a MP3 file which contains speech, without music, when the user clicked the Next/Prev button. I got the PCM data of a mp3 file by calling Sound::lock, but Sound::getFormat only told me it was "16bit integer PCM data", without saying whether it was signed or unsigned. How would I know that?
Some articles on the Internet say that almost all 16-bit integer PCM data are signed. If my PCM data is signed, what range of values represent silence, those values close to 0 (e.g. -10 ~ 10), or the values close to -32768 (e.g. -32768 ~ -32750)? If they are the values close to 0, does this mean that there's no difference in meaning between opposite numbers like -32767 and 32767?
I need to detect silences which are long enough, e.g. longer than 500ms, to determine where each sentence in the speech begins.
Could anyone give me any suggestions on how to detect silence between sentences?
16-bit audio is, by convention, usually signed.
Think about what PCM audio is: each measure is how far along its axis the speaker should physically rest at that moment in time. Therefore perfect silence is absolutely any repeating value — that represents the speaker not moving.
0 is then the centre of the range, and usually where a microphone should be with no input. -32768 is the speaker as close to one end of its axis as it can be, 32767 is it at the other end.
The safest way to detect silence would be to run a spectral analysis over the relevant range and look for periods where there is no activity in any audible frequency range.
If you're looking for pauses between speech then the easiest thing would probably be to go to somewhere like this, plug in an acceptable frequency range for speech (it's considered to be around 300Hz to around 3500Hz in telephony), your sampling rate and however many multiplications you think you can afford. Copy the coefficients supplied. E.g. I assumed you'll do 37 taps across the speech range with a 44100Hz input and, converted to a C array, I got:
double coefficients[] = {
-0.000560, -0.001290, -0.002332, -0.003606, -0.004911, -0.005921, -0.006201,
-0.005256, -0.002610, 0.002106, 0.009059, 0.018139, 0.028924, 0.040691, 0.052479,
0.063203, 0.071794, 0.077351, 0.079274, 0.077351, 0.071794, 0.063203, 0.052479,
0.040691, 0.028924, 0.018139, 0.009059, 0.002106, -0.002610, -0.005256, -0.006201,
-0.005921, -0.004911, -0.003606, -0.002332, -0.001290, -0.000560};
If it were double
input, for each input sample c
I'd then compute a sampled value:
double *inputWave = ... input, an infinite array for the purposes of the example ...
double sampledValue = 0.0;
for(size_t coeff = 0; coeff < numberOfTaps; coeff++) {
sampledValue += coefficients[coeff] * inputWave[c + coeff];
}
// (where numberOfTaps = sizeof(coefficients) / sizeof(coefficients[0]),
// i.e. the number of coefficients: 37 with the array given above)
What I've then got is a bandpass filter. Only that part of the signal representing sound in the frequency range 300–3500Hz should remain in the output values. In real life no such filter is perfect; increase the number of coefficients to increase the quality of your filter.
Having cut irrelevant parts of the signal I could then look for prolonged periods of sampledValue = [close to] 0.0
.