Understanding Keras LSTMs: Role of Batch-size and Statefulness

ascripter picture ascripter · Jan 28, 2018 · Viewed 8.5k times · Source

Sources

There are several sources out there explaining stateful / stateless LSTMs and the role of batch_size which I've read already. I'll refer to them later in my post:

[1] https://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/

[2] https://machinelearningmastery.com/stateful-stateless-lstm-time-series-forecasting-python/

[3] http://philipperemy.github.io/keras-stateful-lstm/

[4] https://machinelearningmastery.com/use-different-batch-sizes-training-predicting-python-keras/

And also other SO threads like Understanding Keras LSTMs and Keras - stateful vs stateless LSTMs which didn't fully explain what I'm looking for however.


My Problem

I am still not sure what is the correct approach for my task regarding statefulness and determining batch_size.

I have about 1000 independent time series (samples) that have a length of about 600 days (timesteps) each (actually variable length, but I thought about trimming the data to a constant timeframe) with 8 features (or input_dim) for each timestep (some of the features are identical to every sample, some individual per sample).

Input shape = (1000, 600, 8)

One of the features is the one I want to predict, while the others are (supposed to be) supportive for the prediction of this one “master feature”. I will do that for each of the 1000 time series. What would be the best strategy to model this problem?

Output shape = (1000, 600, 1)


What is a Batch?

From [4]:

Keras uses fast symbolic mathematical libraries as a backend, such as TensorFlow and Theano.

A downside of using these libraries is that the shape and size of your data must be defined once up front and held constant regardless of whether you are training your network or making predictions.

[…]

This does become a problem when you wish to make fewer predictions than the batch size. For example, you may get the best results with a large batch size, but are required to make predictions for one observation at a time on something like a time series or sequence problem.

This sounds to me like a “batch” would be splitting the data along the timesteps-dimension.

However, [3] states that:

Said differently, whenever you train or test your LSTM, you first have to build your input matrix X of shape nb_samples, timesteps, input_dim where your batch size divides nb_samples. For instance, if nb_samples=1024 and batch_size=64, it means that your model will receive blocks of 64 samples, compute each output (whatever the number of timesteps is for every sample), average the gradients and propagate it to update the parameters vector.

When looking deeper into the examples of [1] and [4], Jason is always splitting his time series to several samples that only contain 1 timestep (the predecessor that in his example fully determines the next element in the sequence). So I think the batches are really split along the samples-axis. (However his approach of time series splitting doesn’t make sense to me for a long-term dependency problem.)

Conclusion

So let’s say I pick batch_size=10, that means during one epoch the weights are updated 1000 / 10 = 100 times with 10 randomly picked, complete time series containing 600 x 8 values, and when I later want to make predictions with the model, I’ll always have to feed it batches of 10 complete time series (or use solution 3 from [4], copying the weights to a new model with different batch_size).

Principles of batch_size understood – however still not knowing what would be a good value for batch_size. and how to determine it


Statefulness

The KERAS documentation tells us

You can set RNN layers to be 'stateful', which means that the states computed for the samples in one batch will be reused as initial states for the samples in the next batch.

If I’m splitting my time series into several samples (like in the examples of [1] and [4]) so that the dependencies I’d like to model span across several batches, or the batch-spanning samples are otherwise correlated with each other, I may need a stateful net, otherwise not. Is that a correct and complete conclusion?

So for my problem I suppose I won’t need a stateful net. I’d build my training data as a 3D array of the shape (samples, timesteps, features) and then call model.fit with a batch_size yet to determine. Sample code could look like:

model = Sequential()
model.add(LSTM(32, input_shape=(600, 8)))   # (timesteps, features)
model.add(LSTM(32))
model.add(LSTM(32))
model.add(LSTM(32))
model.add(Dense(1, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X, y, epochs=500, batch_size=batch_size, verbose=2)

Answer

lsmor picture lsmor · Jan 29, 2018

Let me explain it via an example:

So let's say you have the following series: 1,2,3,4,5,6,...,100. You have to decide how many timesteps your lstm will learn, and reshape your data as so. Like below:

if you decide time_steps = 5, you have to reshape your time series as a matrix of samples in this way:

1,2,3,4,5 -> sample1

2,3,4,5,6 -> sample2

3,4,5,6,7 -> sample3

etc...

By doing so, you will end with a matrix of shape (96 samples x 5 timesteps)

This matrix should be reshape as (96 x 5 x 1) indicating Keras that you have just 1 time series. If you have more time series in parallel (as in your case), you do the same operation on each time series, so you will end with n matrices (one for each time series) each of shape (96 sample x 5 timesteps).

For the sake of argument, let's say you 3 time series. You should concat all of three matrices into one single tensor of shape (96 samples x 5 timeSteps x 3 timeSeries). The first layer of your lstm for this example would be:

    model = Sequential()
    model.add(LSTM(32, input_shape=(5, 3)))

The 32 as first parameter is totally up to you. It means that at each point in time, your 3 time series will become 32 different variables as output space. It is easier to think each time step as a fully conected layer with 3 inputs and 32 outputs but with a different computation than FC layers.

If you are about stacking multiple lstm layers, use return_sequences=True parameter, so the layer will output the whole predicted sequence rather than just the last value.

your target shoud be the next value in the series you want to predict.

Putting all together, let say you have the following time series:

Time series 1 (master): 1,2,3,4,5,6,..., 100

Time series 2 (support): 2,4,6,8,10,12,..., 200

Time series 3 (support): 3,6,9,12,15,18,..., 300

Create the input and target tensor

x     -> y

1,2,3,4,5 -> 6

2,3,4,5,6 -> 7

3,4,5,6,7 -> 8

reformat the rest of time series, but forget about the target since you don't want to predict those series

Create your model

    model = Sequential()
    model.add(LSTM(32, input_shape=(5, 3), return_sequences=True)) # Input is shape (5 timesteps x 3 timeseries), output is shape (5 timesteps x 32 variables) because return_sequences  = True
    model.add(LSTM(8))  # output is shape (1 timesteps x 8 variables) because return_sequences = False
    model.add(Dense(1, activation='linear')) # output is (1 timestep x 1 output unit on dense layer). It is compare to target variable.

Compile it and train. A good batch size is 32. Batch size is the size your sample matrices are splited for faster computation. Just don't use statefull