What does Keras.io.preprocessing.sequence.pad_sequences do?

Koffiman picture Koffiman · Mar 22, 2017 · Viewed 29.9k times · Source

The Keras documentation could be improved here. After reading through this, I still do not understand what this does exactly: Keras.io.preprocessing.sequence.pad_sequences

Could someone illuminate what this function does, and ideally provide an example?

Answer

oscfri picture oscfri · Mar 23, 2017

pad_sequences is used to ensure that all sequences in a list have the same length. By default this is done by padding 0 in the beginning of each sequence until each sequence has the same length as the longest sequence.

For example

>>> pad_sequences([[1, 2, 3], [3, 4, 5, 6], [7, 8]])
array([[0, 1, 2, 3],
       [3, 4, 5, 6],
       [0, 0, 7, 8]], dtype=int32)

[3, 4, 5, 6] is the longest sequence, so 0 will be padded to the other sequences so their length matches [3, 4, 5, 6].

If you rather want to pad to the end of the sequences you can set padding='post'.

If you want to specify the maximum length of each sequence you can use the maxlen argument. This will truncate all sequences longer than maxlen.

>>> pad_sequences([[1, 2, 3], [3, 4, 5, 6], [7, 8]], maxlen=3)
array([[1, 2, 3],
       [4, 5, 6],
       [0, 7, 8]], dtype=int32)

Now each sequence have the length 3 instead.

According to the documentation one can control the truncation with the pad_sequences. By default truncating is set to pre, which truncates the beginning part of the sequence. If you rather want to truncate the end part of the sequence you can set it to post.