I have the following error in my input pipeline:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot batch tensors with different shapes in component 0. First element had shape [2,48,48,3] and element 1 had shape [27,48,48,3].
with this code
dataset = tf.data.Dataset.from_generator(generator,
(tf.float32, tf.int64, tf.int64, tf.float32, tf.int64, tf.float32))
dataset = dataset.batch(max_buffer_size)
This is completely logical as the batch method tries to create a (batch_size, ?, 48, 48, 3) Tensor. However I want that it creates a [29,48,48,3] Tensor for this case. So concatenate instead of stack. Is this possible with tf.data?
I can do the concatenation in Python in the generator function, but I was wondering if this is also possible with the tf.data pipeline
In this case, the generator generates values of shape [None, 48, 48, 3]
where the first dimension could be anything. We want to batch this so that the output is [batch_size, 48, 48, 3]
. If we use directly tf.data.Dataset.batch
, we will have an error, so we need to unbatch first.
To do that we can use tf.contrib.data.unbatch
like this before batching:
dataset = dataset.apply(tf.contrib.data.unbatch())
dataset = dataset.batch(batch_size)
Here is a full example where the generator yields [1]
, [2, 2]
, [3, 3, 3]
and [4, 4, 4, 4]
.
We can't batch these output values directly, so we unbatch and then batch them:
def gen():
for i in range(1, 5):
yield [i] * i
# Create dataset from generator
# The output shape is variable: (None,)
dataset = tf.data.Dataset.from_generator(gen, tf.int64, tf.TensorShape([None]))
# The issue here is that we want to batch the data
dataset = dataset.apply(tf.contrib.data.unbatch())
dataset = dataset.batch(2)
# Create iterator from dataset
iterator = dataset.make_one_shot_iterator()
x = iterator.get_next() # shape (None,)
sess = tf.Session()
for i in range(5):
print(sess.run(x))
This will print the following output:
[1 2]
[2 3]
[3 3]
[4 4]
[4 4]
Update (03/30/2018): I removed the previous answer that used sharding which slows down performance by a lot (see comments).
In this case, we want to concatenate a fixed number of batches. The issue is that these batches have variable sizes. For instance the dataset yields [1]
and [2, 2]
and we want to get [1, 2, 2]
as the output.
Here a quick way to solve this is to create a new generator wrapping around the original one. The new generator will yield batched data. (Thanks to Guillaume for the idea)
Here is a full example where the generator yields [1]
, [2, 2]
, [3, 3, 3]
and [4, 4, 4, 4]
.
def gen():
for i in range(1, 5):
yield [i] * i
def get_batch_gen(gen, batch_size=2):
def batch_gen():
buff = []
for i, x in enumerate(gen()):
if i % batch_size == 0 and buff:
yield np.concatenate(buff, axis=0)
buff = []
buff += [x]
if buff:
yield np.concatenate(buff, axis=0)
return batch_gen
# Create dataset from generator
batch_size = 2
dataset = tf.data.Dataset.from_generator(get_batch_gen(gen, batch_size),
tf.int64, tf.TensorShape([None]))
# Create iterator from dataset
iterator = dataset.make_one_shot_iterator()
x = iterator.get_next() # shape (None,)
with tf.Session() as sess:
for i in range(2):
print(sess.run(x))
This will print the following output:
[1 2 2]
[3 3 3 4 4 4 4]