Open source code for voice detection and discrimination

Croad Langshan picture Croad Langshan · Apr 22, 2011 · Viewed 25.7k times · Source

I have 15 audio tapes, one of which I believe contains an old recording of my grandmother and myself talking. A quick attempt to find the right place didn't turn it up. I don't want to listen to 20 hours of tape to find it. The location may not be at the start of one of the tapes. Most of the content seems to fall into three categories -- in order of total length, longest first: silence, speech radio, and music.

I plan to convert all of the tapes to digital format, and then look again for the recording. The obvious way is to play them all in the background while I'm doing other things. That's far too straightforward for me, so: Are there any open source libraries, or other code, that would allow me to find, in order of increasing sophistication and usefulness:

  1. Non-silent regions
  2. Regions containing human speech
  3. Regions containing my own speech (and that of my grandmother)

My preference is for Python, Java, or C.

Failing answers, hints about search terms would be appreciated since I know nothing about the field.

I understand that I could easily spend more than 20 hours on this.

Answer

hruske picture hruske · Jun 9, 2013

What you probably save you most of the time is speaker diarization. This works by annotating the recording with speaker IDs, which you can then manually map to real people with very little effort. The errors rates are typically at about 10-15% of record length, which sounds awful, but this includes detecting too many speakers and mapping two IDs to same person, which isn't that hard to mend.

One such good tool is SHoUT toolkit (C++), even though it's a bit picky about input format. See usage for this tool from author. It outputs voice/speech activity detection metadata AND speaker diarization, meaning you get 1st and 2nd point (VAD/SAD) and a bit extra, since it annotates when is the same speaker active in a recording.

The other useful tool is LIUM spkdiarization (Java), which basically does the same, except I haven't put enough effort in yet to figure how to get VAD metadata. It features a nice ready to use downloadable package.

With a little bit of compiling, this should work in under an hour.