I am new to deep learning and currently working on using LSTMs for language modeling. I was looking at the pytorch documentation and was confused by it.
If I create a
nn.LSTM(input_size, hidden_size, num_layers)
where hidden_size = 4 and num_layers = 2, I think I will have an architecture something like:
op0 op1 ....
LSTM -> LSTM -> h3
LSTM -> LSTM -> h2
LSTM -> LSTM -> h1
LSTM -> LSTM -> h0
x0 x1 .....
If I do something like
nn.LSTM(input_size, hidden_size, 1)
nn.LSTM(input_size, hidden_size, 1)
I think the network architecture will look exactly like above. Am I wrong? And if yes, what is the difference between these two?
The multi-layer LSTM is better known as stacked LSTM where multiple layers of LSTM are stacked on top of each other.
Your understanding is correct. The following two definitions of stacked LSTM are same.
nn.LSTM(input_size, hidden_size, 2)
and
nn.Sequential(OrderedDict([
('LSTM1', nn.LSTM(input_size, hidden_size, 1),
('LSTM2', nn.LSTM(hidden_size, hidden_size, 1)
]))
Here, the input is feed into the lowest layer of LSTM and then the output of the lowest layer is forwarded to the next layer and so on so forth. Please note, the output size of the lowest LSTM layer and the rest of the LSTM layer's input size is hidden_size
.
However, you may have seen people defined stacked LSTM in the following way:
rnns = nn.ModuleList()
for i in range(nlayers):
input_size = input_size if i == 0 else hidden_size
rnns.append(nn.LSTM(input_size, hidden_size, 1))
The reason people sometimes use the above approach is that if you create a stacked LSTM using the first two approaches, you can't get the hidden states of each individual layer. Check out what LSTM returns in PyTorch.
So, if you want to have the intermedia layer's hidden states, you have to declare each individual LSTM layer as a single LSTM and run through a loop to mimic the multi-layer LSTM operations. For example:
outputs = []
for i in range(nlayers):
if i != 0:
sent_variable = F.dropout(sent_variable, p=0.2, training=True)
output, hidden = rnns[i](sent_variable)
outputs.append(output)
sent_variable = output
In the end, outputs
will contain all the hidden states of each individual LSTM layer.