Comparing two recorded voices

mario.borg picture mario.borg · Jan 11, 2015 · Viewed 7.4k times · Source

I need to find some literature in how to compare a realtime recorded voice (From a mic) against a database of pre-recorded voices. After comparing I would then need to output a match percentage of it.

I am researching on audio fingerprinting, but I cant really get to any conclusion on any literature of such implementation. Any expert out here which can easily guide me in achieving this ?

Answer

Aditya picture Aditya · Jun 9, 2016

I have done similar work before, so I may be the right person to describe the procedure to you.

I had pure recordings of sounds which I considered as gold standards. I had written python scripts to convert these sounds as an array of MFCC vectors. Read more about MFCCs here.

Extracting MFCCs can be considered as the first step in the processing of an audio file, that is features that are good for identifying the acoustic content. I generated MFCCs for every 10ms and had 39 attributes. So a sound file which was 5 seconds long had around 500 MFCCs each having 39 attributes.

Then I wrote an artificial neural network code on these lines . More about neural network can be read from here.

Then I train the neural network's weights and bias known commonly as the network parameters using the stochastic gradient descent algorithm trained using the back propagation algorithm. The trained model was then saved to identify unknown sounds.

The new sounds were then represented as a sequence of MFCC vectors and given as input to the neural network. The neural network is able to predict for each MFCC instance obtained from the new sound file into one of the sound classes that the neural network is trained on. The number of correctly classified MFCC instances gives the accuracy with which the neural network was able classify the unknown sound.

Consider for example : You train your neural network on 4 types of sounds, 1. whistle, 2. car horn, 3. dog bark and 4. siren using the procedure described above.

The new sound is say a siren sound which is 5 s long. You will obtain approximately 500 MFCC instances. The trained neural network will try to classify each of the MFCC instance to one of the classes that the neural network is trained on. So you may get something like this.

30 instances were classified as whistle. 20 instances were classified as car horn/ 10 instances were classified as dog bark and the remaining instances were correctly classified as siren.

The accuracy of classification or rather the commonness between the sounds can be approximately calculated as the ratio of the number of correctly classified instances to the total number of instances which in this case would be 440 / 500 which is 88%. This field is relatively new and much work has been done before using similar machine learning algorithms like Hidden Markov Model, Support Vector Machine and more.

This problem has already been tackled before and you may find some research paper about these in google scholar.