Building Speech Dataset for LSTM binary classification

Nirbhay Tandon picture Nirbhay Tandon · Jan 7, 2016 · Viewed 8.4k times · Source

I'm trying to do binary LSTM classification using theano. I have gone through the example code however I want to build my own.

I have a small set of "Hello" & "Goodbye" recordings that I am using. I preprocess these by extracting the MFCC features for them and saving these features in a text file. I have 20 speech files(10 each) and I am generating a text file for each word, so 20 text files that contains the MFCC features. Each file is a 13x56 matrix.

My problem now is: How do I use this text file to train the LSTM?

I am relatively new to this. I have gone through some literature on it as well but not found really good understanding of the concept.

Any simpler way using LSTM's would also be welcome.

Answer

Nikolay Shmyrev picture Nikolay Shmyrev · Jan 8, 2016

There are many existing implementation for example Tensorflow Implementation, Kaldi-focused implementation with all the scripts, it is better to check them first.

Theano is too low-level, you might try with keras instead, as described in tutorial. You can run tutorial "as is" to understand how things goes.

Then, you need to prepare a dataset. You need to turn your data into sequences of data frames and for every data frame in sequence you need to assign an output label.

Keras supports two types of RNNs - layers returning sequences and layers returning simple values. You can experiment with both, in code you just use return_sequences=True or return_sequences=False

To train with sequences you can assign dummy label for all frames except the last one where you can assign the label of the word you want to recognize. You need to place input and output labels to arrays. So it will be:

X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]

Y = [[0,0,...,1], [0,0,....,2]]

In X every element is a vector of 13 floats. In Y every element is just a number - 0 for intermediate frames and word ID for final frame.

To train with just labels you need to place input and output labels to arrays and output array is simpler. So the data will be:

X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]

Y = [[0,0,1], [0,1,0]]

Note that output is vectorized (np_utils.to_categorical) to turn it to vectors instead of just numbers.

Then you create network architecture. You can have 13 floats for input, a vector for output. In the middle you might have one fully connected layer followed by one lstm layer. Do not use too big layers, start with small ones.

Then you feed this dataset into model.fit and it trains you the model. You can estimate model quality on heldout set after training.

You will have a problem with convergence since you have just 20 examples. You need way more examples, preferably thousands to train LSTM, you will only be able to use very small models.