Difference between 1 LSTM with num_layers = 2 and 2 LSTMs in pytorch

user3828311 picture user3828311 · Mar 11, 2018 · Viewed 10.7k times · Source

I am new to deep learning and currently working on using LSTMs for language modeling. I was looking at the pytorch documentation and was confused by it.

If I create a

nn.LSTM(input_size, hidden_size, num_layers) 

where hidden_size = 4 and num_layers = 2, I think I will have an architecture something like:

op0    op1 ....
LSTM -> LSTM -> h3
LSTM -> LSTM -> h2
LSTM -> LSTM -> h1
LSTM -> LSTM -> h0
x0     x1 .....

If I do something like

nn.LSTM(input_size, hidden_size, 1)
nn.LSTM(input_size, hidden_size, 1)

I think the network architecture will look exactly like above. Am I wrong? And if yes, what is the difference between these two?

Answer

Wasi Ahmad picture Wasi Ahmad · Mar 12, 2018

The multi-layer LSTM is better known as stacked LSTM where multiple layers of LSTM are stacked on top of each other.

Your understanding is correct. The following two definitions of stacked LSTM are same.

nn.LSTM(input_size, hidden_size, 2)

and

nn.Sequential(OrderedDict([
    ('LSTM1', nn.LSTM(input_size, hidden_size, 1),
    ('LSTM2', nn.LSTM(hidden_size, hidden_size, 1)
]))

Here, the input is feed into the lowest layer of LSTM and then the output of the lowest layer is forwarded to the next layer and so on so forth. Please note, the output size of the lowest LSTM layer and the rest of the LSTM layer's input size is hidden_size.

However, you may have seen people defined stacked LSTM in the following way:

rnns = nn.ModuleList()
for i in range(nlayers):
    input_size = input_size if i == 0 else hidden_size
    rnns.append(nn.LSTM(input_size, hidden_size, 1))

The reason people sometimes use the above approach is that if you create a stacked LSTM using the first two approaches, you can't get the hidden states of each individual layer. Check out what LSTM returns in PyTorch.

So, if you want to have the intermedia layer's hidden states, you have to declare each individual LSTM layer as a single LSTM and run through a loop to mimic the multi-layer LSTM operations. For example:

outputs = []
for i in range(nlayers):
    if i != 0:
        sent_variable = F.dropout(sent_variable, p=0.2, training=True)
    output, hidden = rnns[i](sent_variable)
    outputs.append(output)
    sent_variable = output

In the end, outputs will contain all the hidden states of each individual LSTM layer.