I've been coding along this example of a convolution net in TensorFlow and I'm mystified by this allocation of weights:
weights = {
# 5x5 conv, 1 input, 32 outputs
'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])),
# 5x5 conv, 32 inputs, 64 outputs
'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64])),
# fully connected, 7*7*64 inputs, 1024 outputs
'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])),
# 1024 inputs, 10 outputs (class prediction)
'out': tf.Variable(tf.random_normal([1024, n_classes]))
}
How do we know the 'wd1' weight matrix should have 7 x 7 x 64 rows?
It's later used to reshape the output of the second convolution layer:
# Fully connected layer
# Reshape conv2 output to fit dense layer input
dense1 = tf.reshape(conv2, [-1, _weights['wd1'].get_shape().as_list()[0]])
# Relu activation
dense1 = tf.nn.relu(tf.add(tf.matmul(dense1, _weights['wd1']), _biases['bd1']))
By my math, pooling layer 2 (conv2 output) has 4 x 4 x 64 neurons.
Why are we reshaping to [-1, 7*7*64]?
Working from the start:
The input, _X
is of size [28x28x1]
(ignoring the batch dimension). A 28x28 greyscale image.
The first convolutional layer uses PADDING=same
, so it outputs a 28x28 layer, which is then passed to a max_pool
with k=2
, which reduces each dimension by a factor of two, resulting in a 14x14 spatial layout. conv1 has 32 outputs -- so the full per-example tensor is now [14x14x32]
.
This is repeated in conv2
, which has 64 outputs, resulting in a [7x7x64]
.
tl;dr: The image starts as 28x28, and each maxpool reduces it by a factor of two in each dimension. 28/2/2 = 7.