Trying to build simple model just to figure out how to deal with tf.data.Dataset.from_generator
. I can not understand how to set output_shapes
argument. I tried several combinations including not specifying it but still receive some errors due to shape mismatch of the tensors. The idea is just to yield two numpy arrays with SIZE = 10
and run linear regression with them. Here is the code:
SIZE = 10
def _generator():
feats = np.random.normal(0, 1, SIZE)
labels = np.random.normal(0, 1, SIZE)
yield feats, labels
def input_func_gen():
shapes = (SIZE, SIZE)
dataset = tf.data.Dataset.from_generator(generator=_generator,
output_types=(tf.float32, tf.float32),
output_shapes=shapes)
dataset = dataset.batch(10)
dataset = dataset.repeat(20)
iterator = dataset.make_one_shot_iterator()
features_tensors, labels = iterator.get_next()
features = {'x': features_tensors}
return features, labels
def train():
x_col = tf.feature_column.numeric_column(key='x', )
es = tf.estimator.LinearRegressor(feature_columns=[x_col])
es = es.train(input_fn=input_func_gen)
Another question is if it is possible to use this functionality to provide data for feature columns which are tf.feature_column.crossed_column
? The overall goal is to use Dataset.from_generator
functionality in batch training where data is loaded on chunks from a database in cases when data does not fit in memory. All opinions and examples are highly appreciated.
Thanks!
The optional output_shapes
argument of tf.data.Dataset.from_generator()
allows you to specify the shapes of the values yielded from your generator. There are two constraints on its type that define how it should be specified:
The output_shapes
argument is a "nested structure" (e.g. a tuple, a tuple of tuples, a dict of tuples, etc.) that must match the structure of the value(s) yielded by your generator.
In your program, _generator()
contains the statement yield feats, labels
. Therefore the "nested structure" is a tuple of two elements (one for each array).
Each component of the output_shapes
structure should match the shape of the corresponding tensor. The shape of an array is always a tuple of dimensions. (The shape of a tf.Tensor
is more general: see this Stack Overflow question for a discussion.) Let's look at the actual shape of feats
:
>>> SIZE = 10
>>> feats = np.random.normal(0, 1, SIZE)
>>> print feats.shape
(10,)
Therefore the output_shapes
argument should be a 2-element tuple, where each element is (SIZE,)
:
shapes = ((SIZE,), (SIZE,))
dataset = tf.data.Dataset.from_generator(generator=_generator,
output_types=(tf.float32, tf.float32),
output_shapes=shapes)
Finally, you will need to provide a little more information about shapes to the tf.feature_column.numeric_column()
and tf.estimator.LinearRegressor()
APIs:
x_col = tf.feature_column.numeric_column(key='x', shape=(SIZE,))
es = tf.estimator.LinearRegressor(feature_columns=[x_col],
label_dimension=10)