How to read data into TensorFlow batches from example queue?

JohnAllen picture JohnAllen · May 9, 2016 · Viewed 31.1k times · Source

How do I get TensorFlow example queues into proper batches for training?

I've got some images and labels:

IMG_6642.JPG 1
IMG_6643.JPG 2

(feel free to suggest another label format; I think I may need another dense to sparse step...)

I've read through quite a few tutorials but don't quite have it all together yet. Here's what I have, with comments indicating the steps required from TensorFlow's Reading Data page.

  1. The list of filenames (optional steps removed for the sake of simplicity)
  2. Filename queue
  3. A Reader for the file format
  4. A decoder for a record read by the reader
  5. Example queue

And after the example queue I need to get this queue into batches for training; that's where I'm stuck...

1. List of filenames

files = tf.train.match_filenames_once('*.JPG')

4. Filename queue

filename_queue = tf.train.string_input_producer(files, num_epochs=None, shuffle=True, seed=None, shared_name=None, name=None)

5. A reader

reader = tf.TextLineReader() key, value = reader.read(filename_queue)

6. A decoder

record_defaults = [[""], [1]] col1, col2 = tf.decode_csv(value, record_defaults=record_defaults) (I don't think I need this step below because I already have my label in a tensor but I include it anyways)

features = tf.pack([col2])

The documentation page has an example to run one image, not get the images and labels into batches:

for i in range(1200): # Retrieve a single instance: example, label = sess.run([features, col5])

And then below it has a batching section:

def read_my_file_format(filename_queue):
  reader = tf.SomeReader()
  key, record_string = reader.read(filename_queue)
  example, label = tf.some_decoder(record_string)
  processed_example = some_processing(example)
  return processed_example, label

def input_pipeline(filenames, batch_size, num_epochs=None):
  filename_queue = tf.train.string_input_producer(
  filenames, num_epochs=num_epochs, shuffle=True)
  example, label = read_my_file_format(filename_queue)
  # min_after_dequeue defines how big a buffer we will randomly sample
  #   from -- bigger means better shuffling but slower start up and more
  #   memory used.
  # capacity must be larger than min_after_dequeue and the amount larger
  #   determines the maximum we will prefetch.  Recommendation:
  #   min_after_dequeue + (num_threads + a small safety margin) *              batch_size
  min_after_dequeue = 10000
  capacity = min_after_dequeue + 3 * batch_size
  example_batch, label_batch = tf.train.shuffle_batch(
  [example, label], batch_size=batch_size, capacity=capacity,
  min_after_dequeue=min_after_dequeue)
  return example_batch, label_batch

My question is: how do I use the above example code with the code I have above? I need batches to work with, and most of the tutorials come with mnist batches already.

with tf.Session() as sess:
  sess.run(init)

  # Training cycle
for epoch in range(training_epochs):
    total_batch = int(mnist.train.num_examples/batch_size)
    # Loop over all batches
    for i in range(total_batch):
        batch_xs, batch_ys = mnist.train.next_batch(batch_size)

Answer

user5869947 picture user5869947 · May 10, 2016

If you wish to make this input pipeline work, you will need add an asynchronous queue'ing mechanism that generate batches of examples. This is performed by creating a tf.RandomShuffleQueue or a tf.FIFOQueue and inserting JPEG images that have been read, decoded and preprocessed.

You can use handy constructs that will generate the Queues and the corresponding threads for running the queues via tf.train.shuffle_batch_join or tf.train.batch_join. Here is a simplified example of what this would like. Note that this code is untested:

# Let's assume there is a Queue that maintains a list of all filenames
# called 'filename_queue'
_, file_buffer = reader.read(filename_queue)

# Decode the JPEG images
images = []
image = decode_jpeg(file_buffer)

# Generate batches of images of this size.
batch_size = 32

# Depends on the number of files and the training speed.
min_queue_examples = batch_size * 100
images_batch = tf.train.shuffle_batch_join(
  image,
  batch_size=batch_size,
  capacity=min_queue_examples + 3 * batch_size,
  min_after_dequeue=min_queue_examples)

# Run your network on this batch of images.
predictions = my_inference(images_batch)

Depending on how you need to scale up your job, you might need to run multiple independent threads that read/decode/preprocess images and dump them in your example queue. A complete example of such a pipeline is provided in the Inception/ImageNet model. Take a look at batch_inputs:

https://github.com/tensorflow/models/blob/master/inception/inception/image_processing.py#L407

Finally, if you are working with >O(1000) JPEG images, keep in mind that it is extremely inefficient to individually ready 1000's of small files. This will slow down your training quite a bit.

A more robust and faster solution to convert a dataset of images to a sharded TFRecord of Example protos. Here is a fully worked script for converting the ImageNet data set to such a format. And here is a set of instructions for running a generic version of this preprocessing script on an arbitrary directory containing JPEG images.