Meaning of buffer_size in Dataset.map , Dataset.prefetch and Dataset.shuffle

Ujjwal picture Ujjwal · Sep 27, 2017 · Viewed 43.3k times · Source

As per TensorFlow documentation , the prefetch and map methods of tf.contrib.data.Dataset class, both have a parameter called buffer_size.

For prefetch method, the parameter is known as buffer_size and according to documentation :

buffer_size: A tf.int64 scalar tf.Tensor, representing the maximum number elements that will be buffered when prefetching.

For the map method, the parameter is known as output_buffer_size and according to documentation :

output_buffer_size: (Optional.) A tf.int64 scalar tf.Tensor, representing the maximum number of processed elements that will be buffered.

Similarly for the shuffle method, the same quantity appears and according to documentation :

buffer_size: A tf.int64 scalar tf.Tensor, representing the number of elements from this dataset from which the new dataset will sample.

What is the relation between these parameters ?

Suppose I create aDataset object as follows :

 tr_data = TFRecordDataset(trainfilenames)
    tr_data = tr_data.map(providefortraining, output_buffer_size=10 * trainbatchsize, num_parallel_calls\
=5)
    tr_data = tr_data.shuffle(buffer_size= 100 * trainbatchsize)
    tr_data = tr_data.prefetch(buffer_size = 10 * trainbatchsize)
    tr_data = tr_data.batch(trainbatchsize)

What role is being played by the buffer parameters in the above snippet ?

Answer

mrry picture mrry · Oct 31, 2017

TL;DR Despite their similar names, these arguments have quite difference meanings. The buffer_size in Dataset.shuffle() can affect the randomness of your dataset, and hence the order in which elements are produced. The buffer_size in Dataset.prefetch() only affects the time it takes to produce the next element.


The buffer_size argument in tf.data.Dataset.prefetch() and the output_buffer_size argument in tf.contrib.data.Dataset.map() provide a way to tune the performance of your input pipeline: both arguments tell TensorFlow to create a buffer of at most buffer_size elements, and a background thread to fill that buffer in the background. (Note that we removed the output_buffer_size argument from Dataset.map() when it moved from tf.contrib.data to tf.data. New code should use Dataset.prefetch() after map() to get the same behavior.)

Adding a prefetch buffer can improve performance by overlapping the preprocessing of data with downstream computation. Typically it is most useful to add a small prefetch buffer (with perhaps just a single element) at the very end of the pipeline, but more complex pipelines can benefit from additional prefetching, especially when the time to produce a single element can vary.

By contrast, the buffer_size argument to tf.data.Dataset.shuffle() affects the randomness of the transformation. We designed the Dataset.shuffle() transformation (like the tf.train.shuffle_batch() function that it replaces) to handle datasets that are too large to fit in memory. Instead of shuffling the entire dataset, it maintains a buffer of buffer_size elements, and randomly selects the next element from that buffer (replacing it with the next input element, if one is available). Changing the value of buffer_size affects how uniform the shuffling is: if buffer_size is greater than the number of elements in the dataset, you get a uniform shuffle; if it is 1 then you get no shuffling at all. For very large datasets, a typical "good enough" approach is to randomly shard the data into multiple files once before training, then shuffle the filenames uniformly, and then use a smaller shuffle buffer. However, the appropriate choice will depend on the exact nature of your training job.