Google Cloud Speech: Distinguish Voices?

user800576 picture user800576 · Feb 1, 2017 · Viewed 7.3k times · Source

I am interested in writing a voice recognition application that is aware of multiple speakers. For example if Bill, Joe, and Jane are talking then the application could not only recognize sounds as text but also classify the results by speaker (say 0, 1 and 2... because obviously/hopefully Google has no means of linking voices to people).

I am hunting for speech recognition APIs that might do this, and Google Cloud Speech comes up as a top ranked API. I have looked through the API docs to see if such functionality is available, and have not found it.

My question is: does/will this functionality exist?

Note: Google's support page says their engineers sometimes answer these questions on SO, so it seems plausible someone might have an answer to the "will" part of the question.

Answer

John Saunders picture John Saunders · Oct 4, 2017

IMB's speech to text service does it. If you use their rest service its very simple, just add that you want different speakers identified in the url param. Documentation for it here (https://console.bluemix.net/docs/services/speech-to-text/output.html#speaker_labels)

it works kind of like this:

 curl -X POST -u {username}:{password}
--header "Content-Type: audio/flac"
--data-binary @{path}audio-multi.flac
"https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?model=en-US_NarrowbandModel&speaker_labels=true"

then it will return a json with the results and speaker labels like this :

{
 "results": [
    {
      "alternatives": [
        {
          "timestamps": [
            [
              "hello",
              0.68,
              1.19
            ],
            [
              "yeah",
              1.47,
              1.93
            ],
            [
              "yeah",
              1.96,
              2.12
            ],
            [
              "how's",
              2.12,
              2.59
            ],
            [
              "Billy",
              2.59,
              3.17
            ],
            . . .
          ]
          "confidence": 0.821,
          "transcript": "hello yeah yeah how's Billy "
        }
      ],
      "final": true
    }
  ],
  "result_index": 0,
  "speaker_labels": [
    {
      "from": 0.68,
      "to": 1.19,
      "speaker": 2,
      "confidence": 0.418,
      "final": false
    },
    {
      "from": 1.47,
      "to": 1.93,
      "speaker": 1,
      "confidence": 0.521,
      "final": false
    },
    {
      "from": 1.96,
      "to": 2.12,
      "speaker": 2,
      "confidence": 0.407,
      "final": false
    },
    {
      "from": 2.12,
      "to": 2.59,
      "speaker": 2,
      "confidence": 0.407,
      "final": false
    },
    {
      "from": 2.59,
      "to": 3.17,
      "speaker": 2,
      "confidence": 0.407,
      "final": false
    },
    . . .
  ]
}

they also have web socket options and SDKs for different platforms that will access this, no just rest calls.

good luck