Can anyone recommend reliable open source software for transcribing English speech in wav files? The two main programs I've researched are Sphinx and Julius, but I've never been able to get either to work, and the documentation with each on transcribing files is sketchy at best.
I'm developing on 64-bit Ubuntu 10.04, whose repos include sphinx2 and julius, as well as voxforge's julius acoustic modal for English. I'm focussing on transcribing files, instead of directly processing sound from a mic, because I've given up on expecting projects like these to work with Ubuntu's sound system. This isn't a knock against Ubuntu, as I can record sound with my mic perfectly using Audacity, but neither system seems able to access my mic, so I'm hoping I can simply their configuration by just reading from a file.
I first tried Sphinx2, from the Ubuntu package sphinx2-bin. Even though the sample sphinx2-demo seemed to work on transcribing a file, there's virtually no documentation on the configuration, so I'm not sure how I'd customize this to read from an arbitrary wav. The audio file used in the demo is in some undocumented "16k" format, which is indirectly referenced through 2 configuration files. There's a brief blurb describing sphinx2-demo as running sphinx2-batch, but inspecting the script shows it's actually calling sphinx2-continuous. Even worse, the --help docs for each script list about 6 dozen options, and doesn't mention which are required or optional. Overall, the lack of sphinx documentation, and the poor quality of existing documentation is driving me nuts.
I next tried Julius, again from the Ubuntu package, which was surprisingly recent (4.1), considering the version used in Voxforge's quickstart is 3.5. The package seems to include slightly better documentation, and even an example written in Python (/usr/share/doc/julius-voxforge/examples/controlapp). After reading the example's docs, I tried adapting it to read from a file by creating a file filelist.txt
containing the text "hello.wav" referring to a file of the same name, containing a recording of someone saying "hello". Placing these in the same directory, I ran:
julius -input file -filelist filelist.txt -C julian.jconf
getting the response:
### read waveform input
Error: adin_file: sampling rate != 16000 (8000)
Error: adin_file: error in parsing wav header at hello.wav
Error: adin_file: failed to read speech data: "hello.wav"
0 files processed
Retrying by specifying absolute filenames for filelist.txt and hello.wav produce the same error.
I also tried the Julius call used in the example, to record directly from a mic:
julius -input mic -C julian.jconf
I called this several times, and the response varied between the error:
Cannot read /dev/dsp
and:
STAT: AD-in thread created
<<< please speak >>>
In the later case, no matter what I say into the mic, nothing happens. I can't tell if it's still unable to read the mic, or if it's reading something, but is simply unable to transcribe the audio.
I'm not sure what to make of this. The errors I'm getting don't leave me with much to go on. Why can't it read a wav? Why can't it read /dev/dsp? Why does it then appear to be able to read /dev/dsp, but not react in any way?
Has anyone else had any success with open source speech recognizers, especially on Linux?
Why can't it read a wav?
It tells you that the file has wrong sampling rate (8000) instead of requested (16000). Sampling rate is very important for speech recognition software.
Why can't it read /dev/dsp?
In recent versions of Ubuntu pulseaudio framework is used instead of OSS. The version you are trying is using OSS so you need to install oss-compatibility package from your distribution to bring OSS support back.
You can try newer Julius which has pulseaudio support
Why does it then appear to be able to read /dev/dsp, but not react in any way?
Audio input doesn't work properly.
Has anyone else had any success with open source speech recognizers, especially on Linux?
Sure, check this video as an example of what people do with CMUSphinx:
http://www.youtube.com/watch?v=vfaNLIowSyk
I suggest you to revisit CMUSphinx package which is a leading open source speech recognition engine. There are loads of documents on the website, you just need to read them. Remember that speech recognition is a complex area where you can get a great results but you also need to invest your time in understanding the technology. Just like with any other domain.
In short, to transcribe a file with CMUSPhinx you need to do the following 3 simple steps:
sox input.wav -r 8000 -c 1 resampled.wav
apt-get install pocketsphinx
pocketsphinx_continuous -samprate 8000 -infile resampled.wav
The result will be printed to standard output. To supress the logger, add stderr redirection to /dev/null
pocketsphinx_continuous -infile resampled.wav 2> /dev/null