I have a set of audio files that are uploaded by users, and there is no knowing what they contain.
I would like to take an arbitrary audio file, and extract each of the instances where someone is speaking into separate audio files. I don't want to detect the actual words, just the "started speaking", "stopped speaking" points and generate new files at these points.
(I'm targeting a Linux environment, and developing on a Mac)
I've found Sox, which looks promising, and it has a 'vad' mode (Voice Activity Detection). However this appears to find the first instance of speech and strips audio until that point, so it's close, but not quite right.
I've also looked at Python's 'wave' library, but then I'd need to write my own implementation of Sox's 'vad'.
Are there any command line tools that would do what I want off the shelf? If not, any good Python or Ruby approaches?
For Voice Activity Detection, I have been using the EnergyDetector program of the MISTRAL (was LIA_RAL) speaker recognition toolkit, based on the ALIZE library.
It works with feature files, not with audio files, so you'll need to extract the energy of the signal. I usually extract cepstral features (MFCC) with the log-energy parameter, and I use this parameter for VAD. You can use sfbcep`, an utility part of the SPro signal processing toolkit in the following way:
sfbcep -F PCM16 -p 19 -e -D -A input.wav output.prm
It will extract 19 MFCC + log-energy coefficient + first and second order delta coefficients. The energy coefficient is the 19th, you will specify that in the EnergyDetector configuration file.
You will then run EnergyDetector in this way:
EnergyDetector --config cfg/EnergyDetector.cfg --inputFeatureFilename output
If you use the configuration file that you find at the end of the answer, you need to put output.prm
in prm/
, and you'll find the segmentation in lbl/
.
As a reference, I attach my EnergyDetector configuration file:
*** EnergyDetector Config File
***
loadFeatureFileExtension .prm
minLLK -200
maxLLK 1000
bigEndian false
loadFeatureFileFormat SPRO4
saveFeatureFileFormat SPRO4
saveFeatureFileSPro3DataKind FBCEPSTRA
featureServerBufferSize ALL_FEATURES
featureServerMemAlloc 50000000
featureFilesPath prm/
mixtureFilesPath gmm/
lstPath lst/
labelOutputFrames speech
labelSelectedFrames all
addDefaultLabel true
defaultLabel all
saveLabelFileExtension .lbl
labelFilesPath lbl/
frameLength 0.01
segmentalMode file
nbTrainIt 8
varianceFlooring 0.0001
varianceCeiling 1.5
alpha 0.25
mixtureDistribCount 3
featureServerMask 19
vectSize 1
baggedFrameProbabilityInit 0.1
thresholdMode weight
The CMU Sphinx speech recognition software contains a built-in VAD. It is written in C, and you might be able to hack it to produce a label file for you.
A very recent addition is the GStreamer support. This means that you can use its VAD in a GStreamer media pipeline. See Using PocketSphinx with GStreamer and Python -> The 'vader' element
I have also been using a modified version of the AMR1 Codec that outputs a file with speech/non speech classification, but I cannot find its sources online, sorry.