Using Keras for video prediction (time series)

Isa picture Isa · Mar 6, 2017 · Viewed 9.3k times · Source

I want to predict the next frame of a (greyscale) video given N previous frames - using CNNs or RNNs in Keras. Most tutorials and other information regarding time series prediction and Keras use a 1-dimensional input in their network but mine would be 3D (N frames x rows x cols)

I'm currently really unsure what a good approach for this problem would be. My ideas include:

  • Using one or more LSTM layers. The problem here is that I'm not sure whether they're suited to take a series of images instead a series of scalars as input. Wouldn't the memory consumption explode? If it is okay to use them: How can I use them in Keras for higher dimensions?

  • Using 3D convolution on the input (the stack of previous video frames). This raises other questions: Why would this help when I'm not doing a classification but a prediction? How can I stack the layers in such a way that the input of the network has dimensions (N x cols x rows) and the output (1 x cols x rows)?

I'm pretty new to CNNs/RNNs and Keras and would appreciate any hint into the right direction.

Answer

Marcin Możejko picture Marcin Możejko · Mar 6, 2017

So basically every approach has its advantages and disadvantages. Let's go throught the ones you provided and then other to find the best approach:

  1. LSTM: Among their biggest advantages is an ability to learn a long-term dependiencies patterns in your data. They were designed in order to be able to analyse long sequences like e.g. speech or text. This is also might cause problems because of number parameters which could be really high. Other typical recurrent network architectures like GRU might overcome this issues. The main disadvantage is that in their standard (sequential implementation) it's infeasible to fit it on a video data for the same reason why dense layers are bad for an imagery data - loads of time and spatial invariances must be learnt by a topology which is completely not suited for catching them in an efficient manner. Shifting a video by a pixel to the right might completely change the output of your network.

    Other thing which is worth to mention is that training LSTM is belived to be similiar to finding equilibrium between two rivalry processes - finding good weights for a dense-like output computations and finding a good inner-memory dynamic in processing sequences. Finding this equilibrium might last for a really long time but once its finded - it's usually quite stable and produces a really good results.

  2. Conv3D: Among their biggest advantages one may easily find an ability to catch spatial and temporal invariances in the same manner as Conv2D in an imagery case. This make the curse of dimensionality much less harmful. On the other hand - in the same way as Conv1D might not produce good results with a longer sequences - in the same way - a lack of any memory might make learning a long sequence harder.

Of course one may use different approaches like:

  1. TimeDistributed + Conv2D: using a TimeDistributed wrapper - one may use some pretrained convnet like e.g. Inception framewise and then analyse the feature maps sequentially. A really huge advantage of this approach is a possibility of a transfer learning. As a disadvantage - one may think about it as a Conv2.5D - it lacks temporal analysis of your data.

  2. ConvLSTM: this architecture is not yet supported by the newest version of Keras (on March 6th 2017) but as one may see here it should be provided in the future. This is a mixture of LSTM and Conv2D and it's belived to be better then stacking Conv2D and LSTM.

Of course these are not the only way to solve this problem, I'll mention one more which might be usefull:

  1. Stacking: one may easily stack the upper methods in order to build their final solution. E.g. one may build a network where at the beginning video is transformed using a TimeDistributed(ResNet) then output is feed to Conv3D with multiple and agressive spatial pooling and finally transformed by an GRU/LSTM layer.

PS:

One more thing that is also worth to mention is that shape of video data is actually 4D with (frames, width, height, channels).

PS2:

In case when your data is actually 3D with (frames, width, hieght) you actually could use a classic Conv2D (by changing channels to frames) to analyse this data (which actually might more computationally effective). In case of a transfer learning you should add additional dimension because most of CNN models were trained on data with shape (width, height, 3). One may notice that your data doesn't have 3 channels. In this case a technique which is usually used is repeating spatial matrix three times.

PS3:

An example of this 2.5D approach is:

input = Input(shape=input_shape)
base_cnn_model = InceptionV3(include_top=False, ..)
temporal_analysis = TimeDistributed(base_cnn_model)(input)
conv3d_analysis = Conv3D(nb_of_filters, 3, 3, 3)(temporal_analysis)
conv3d_analysis = Conv3D(nb_of_filters, 3, 3, 3)(conv3d_analysis)
output = Flatten()(conv3d_analysis)
output = Dense(nb_of_classes, activation="softmax")(output)