I need some clarity on how to correctly prepare inputs for batch-training using different components of the torch.nn
module. Specifically, I'm looking to create an encoder-decoder network for a seq2seq model.
Suppose I have a module with these three layers, in order:
nn.Embedding
nn.LSTM
nn.Linear
nn.Embedding
Input: batch_size * seq_length
Output: batch_size * seq_length * embedding_dimension
I don't have any problems here, I just want to be explicit about the expected shape of the input and output.
nn.LSTM
Input: seq_length * batch_size * input_size
(embedding_dimension
in this case)
Output: seq_length * batch_size * hidden_size
last_hidden_state: batch_size * hidden_size
last_cell_state: batch_size * hidden_size
To use the output of the Embedding
layer as input for the LSTM
layer, I need to transpose axis 1 and 2.
Many examples I've found online do something like x = embeds.view(len(sentence), self.batch_size , -1)
, but that confuses me. How does this view ensure that elements of the same batch remain in the same batch? What happens when len(sentence)
and self.batch
size are of same size?
nn.Linear
Input: batch_size
x input_size
(hidden_size of LSTM in this case or ??)
Output: batch_size
x output_size
If I only need the last_hidden_state
of LSTM
, then I can give it as input to nn.Linear
.
But if I want to make use of Output (which contains all intermediate hidden states as well) then I need to change nn.Linear
's input size to seq_length * hidden_size
and to use Output as input to Linear
module I need to transpose axis 1 and 2 of output and then I can view with Output_transposed(batch_size, -1)
.
Is my understanding here correct? How do I carry out these transpose operations in tensors (tensor.transpose(0, 1))
?
Your understanding of most of the concepts is accurate, but, there are some missing points here and there.
You have embedding output in the shape of (batch_size, seq_len, embedding_size)
. Now, there are various ways through which you can pass this to the LSTM.
* You can pass this directly to the LSTM
, if LSTM
accepts input as batch_first
. So, while creating your LSTM
pass argument batch_first=True
.
* Or, you can pass input in the shape of (seq_len, batch_size, embedding_size)
. So, to convert your embedding output to this shape, you’ll need to transpose the first and second dimensions using torch.transpose(tensor_name, 0, 1)
, like you mentioned.
Q. I see many examples online which do something like x = embeds.view(len(sentence), self.batch_size , -1) which confuses me.
A. This is wrong. It will mix up batches and you will be trying to learn a hopeless learning task. Wherever you see this, you can tell the author to change this statement and use transpose instead.
There is an argument in favor of not using batch_first
, which states that the underlying API provided by Nvidia CUDA runs considerably faster using batch as secondary.
You are directly feeding the embedding output to LSTM, this will fix the input size of LSTM to context size of 1. This means that if your input is words to LSTM, you will be giving it one word at a time always. But, this is not what we want all the time. So, you need to expand the context size. This can be done as follows -
# Assuming that embeds is the embedding output and context_size is a defined variable
embeds = embeds.unfold(1, context_size, 1) # Keeping the step size to be 1
embeds = embeds.view(embeds.size(0), embeds.size(1), -1)
Unfold documentation
Now, you can proceed as mentioned above to feed this to the LSTM
, just remembed that seq_len
is now changed to seq_len - context_size + 1
and embedding_size
(which is the input size of the LSTM) is now changed to context_size * embedding_size
Input size of different instances in a batch will not be the same always. For example, some of your sentence might be 10 words long and some might be 15 and some might be 1000. So, you definitely want variable length sequence input to your recurrent unit. To do this, there are some additional steps that needs to be performed before you can feed your input to the network. You can follow these steps -
1. Sort your batch from largest sequence to the smallest.
2. Create a seq_lengths
array that defines the length of each sequence in the batch. (This can be a simple python list)
3. Pad all the sequences to be of equal length to the largest sequence.
4. Create LongTensor Variable of this batch.
5. Now, after passing the above variable through embedding and creating the proper context size input, you’ll need to pack your sequence as follows -
# Assuming embeds to be the proper input to the LSTM
lstm_input = nn.utils.rnn.pack_padded_sequence(embeds, [x - context_size + 1 for x in seq_lengths], batch_first=False)
Now, once you have prepared your lstm_input
acc. To your needs, you can call lstm as
lstm_outs, (h_t, h_c) = lstm(lstm_input, (h_t, h_c))
Here, (h_t, h_c)
needs to be provided as the initial hidden state and it will output the final hidden state. You can see, why packing variable length sequence is required, otherwise LSTM will run the over the non-required padded words as well.
Now, lstm_outs
will be a packed sequence which is the output of lstm at every step and (h_t, h_c)
are the final outputs and the final cell state respectively. h_t
and h_c
will be of shape (batch_size, lstm_size)
. You can use these directly for further input, but if you want to use the intermediate outputs as well you’ll need to unpack the lstm_outs
first as below
lstm_outs, _ = nn.utils.rnn.pad_packed_sequence(lstm_outs)
Now, your lstm_outs
will be of shape (max_seq_len - context_size + 1, batch_size, lstm_size)
. Now, you can extract the intermediate outputs of lstm according to your need.
Remember that the unpacked output will have 0s after the size of each batch, which is just padding to match the length of the largest sequence (which is always the first one, as we sorted the input from largest to the smallest).
Also note that, h_t will always be equal to the last element for each batch output.
Now, if you want to use just the output of the lstm, you can directly feed h_t
to your linear layer and it will work. But, if you want to use intermediate outputs as well, then, you’ll need to figure out, how are you going to input this to the linear layer (through some attention network or some pooling). You do not want to input the complete sequence to the linear layer, as different sequences will be of different lengths and you can’t fix the input size of the linear layer. And yes, you’ll need to transpose the output of lstm to be further used (Again you cannot use view here).
Ending Note: I have purposefully left some points, such as using bidirectional recurrent cells, using step size in unfold, and interfacing attention, as they can get quite cumbersome and will be out of the scope of this answer.