How to split video or audio by silent parts

TermiT picture TermiT · Mar 18, 2016 · Viewed 14k times · Source

I need to automatically split video of a speech by words, so every word is a separate video file. Do you know any ways to do this?

My plan was to detect silent parts and use them as words separators. But i didn't find any tool to do this and looks like ffmpeg is not the right tool for that.


Gyan picture Gyan · Mar 18, 2016

You could first use ffmpeg to detect intervals of silence, like this

ffmpeg -i "" -af silencedetect=noise=-30dB:d=0.5 -f null - 2> vol.txt

This will produce console output with readings that look like this:

[silencedetect @ 00000000004b02c0] silence_start: -0.0306667
[silencedetect @ 00000000004b02c0] silence_end: 1.42767 | silence_duration: 1.45833
[silencedetect @ 00000000004b02c0] silence_start: 2.21583
[silencedetect @ 00000000004b02c0] silence_end: 2.7585 | silence_duration: 0.542667
[silencedetect @ 00000000004b02c0] silence_start: 3.1315
[silencedetect @ 00000000004b02c0] silence_end: 5.21833 | silence_duration: 2.08683
[silencedetect @ 00000000004b02c0] silence_start: 5.3895
[silencedetect @ 00000000004b02c0] silence_end: 7.84883 | silence_duration: 2.45933
[silencedetect @ 00000000004b02c0] silence_start: 8.05117
[silencedetect @ 00000000004b02c0] silence_end: 10.0953 | silence_duration: 2.04417
[silencedetect @ 00000000004b02c0] silence_start: 10.4798
[silencedetect @ 00000000004b02c0] silence_end: 12.4387 | silence_duration: 1.95883
[silencedetect @ 00000000004b02c0] silence_start: 12.6837
[silencedetect @ 00000000004b02c0] silence_end: 14.5572 | silence_duration: 1.8735
[silencedetect @ 00000000004b02c0] silence_start: 14.9843
[silencedetect @ 00000000004b02c0] silence_end: 16.5165 | silence_duration: 1.53217

You then generate commands to split from each silence end to the next silence start. You will probably want to add some handles of, say, 250 ms, so the audio will have a duration of 250 ms * 2 more.

ffmpeg -ss <silence_end - 0.25> -t <next_silence_start - silence_end + 2 * 0.25> -i

(I have skipped specifying audio/video parameters)

You'll want to write a script to scrape the console log and generate a structured (maybe CSV) file with the timecodes - one pair on each line: silence_end and the next silence_start. And then another script to generate the commands with each pair of numbers.