I have a bunch of audio files and need to split each files based on silence and using SOX. However, I realize that some files have very noisy background and some don't thus I can't use a single set of parameter to iterate over all files doing the split. I try to figure out how to separate them by noisy background. Here is what I got from sox input1.flac -n stat
and sox input2.flac -n stat
Samples read: 18207744
Length (seconds): 568.992000
Scaled by: 2147483647.0
Maximum amplitude: 0.999969
Minimum amplitude: -1.000000
Midline amplitude: -0.000015
Mean norm: 0.031888
Mean amplitude: -0.000361
RMS amplitude: 0.053763
Maximum delta: 0.858917
Minimum delta: 0.000000
Mean delta: 0.018609
RMS delta: 0.039249
Rough frequency: 1859
Volume adjustment: 1.000
and
Samples read: 198976896
Length (seconds): 6218.028000
Scaled by: 2147483647.0
Maximum amplitude: 0.999969
Minimum amplitude: -1.000000
Midline amplitude: -0.000015
Mean norm: 0.156168
Mean amplitude: -0.000010
RMS amplitude: 0.211787
Maximum delta: 1.999969
Minimum delta: 0.000000
Mean delta: 0.091605
RMS delta: 0.123462
Rough frequency: 1484
Volume adjustment: 1.000
The former does not contain noisy background and the latter does. I suspect I can use the Sample Mean
of Max delta
because of the big gap.
Can anyone explain for me the meaning of those stats, or at least show me where I can get it myself (I tried looking up in official documentation but they don't explain). Many thanks.
I don't know how I've managed to miss stat in the SoX docs all this time, it's right there.
Personally I'd rather use the stats
function, whose output I find much more practically useful.
As a measure to differentiate between the more or less noisy audio I'd try using the difference between the highest and lowest sound levels. The quietest parts will never be quieter than the background noise alone, so if there is little difference the audio is either noisy, or just loud all the time, like a compressed pop song. You could take the difference between the maximum and minimum RMS values, or between peak and minimum RMS. The RMS window length should be kept fairly short, say between 10 and 200ms, and if the audio has fade-in or fade-out sections, those should be trimmed away, though I didn't include that in the code.
audio="input1.flac"
width=0.01
# Mixes down multi-channel files to mono
stats=$(sox "$audio" -n channels 1 stats -w $width 2>&1 |\
grep "Pk lev dB\|RMS Pk dB\|RMS Tr dB" |\
sed 's/[^0-9.-]*//g')
peak=$(head -n 1 <<< "$stats")
rmsmax=$(head -n 2 <<< "$stats" | tail -n 1)
rmsmin=$(tail -n 1 <<< "$stats")
rmsdif=$(bc <<< "scale=3; $rmsmax - $rmsmin")
pkmindif=$(bc <<< "scale=3; $peak - $rmsmin")
echo "
max RMS: $rmsmax
min RMS: $rmsmin
diff RMS: $rmsdif
peak-min: $pkmindif
"