Fully-connected layer weight dimensions in TensorFlow ConvNet

jfbeltran picture jfbeltran · Jan 7, 2016 · Viewed 9.8k times · Source

I've been coding along this example of a convolution net in TensorFlow and I'm mystified by this allocation of weights:

weights = {

# 5x5 conv, 1 input, 32 outputs
'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])),

# 5x5 conv, 32 inputs, 64 outputs
'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64])), 

# fully connected, 7*7*64 inputs, 1024 outputs
'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])), 

# 1024 inputs, 10 outputs (class prediction)
'out': tf.Variable(tf.random_normal([1024, n_classes])) 

}

How do we know the 'wd1' weight matrix should have 7 x 7 x 64 rows?

It's later used to reshape the output of the second convolution layer:

# Fully connected layer
# Reshape conv2 output to fit dense layer input
dense1 = tf.reshape(conv2, [-1, _weights['wd1'].get_shape().as_list()[0]]) 

# Relu activation
dense1 = tf.nn.relu(tf.add(tf.matmul(dense1, _weights['wd1']), _biases['bd1']))

By my math, pooling layer 2 (conv2 output) has 4 x 4 x 64 neurons.

Why are we reshaping to [-1, 7*7*64]?

Answer

dga picture dga · Jan 7, 2016

Working from the start:

The input, _X is of size [28x28x1] (ignoring the batch dimension). A 28x28 greyscale image.

The first convolutional layer uses PADDING=same, so it outputs a 28x28 layer, which is then passed to a max_pool with k=2, which reduces each dimension by a factor of two, resulting in a 14x14 spatial layout. conv1 has 32 outputs -- so the full per-example tensor is now [14x14x32].

This is repeated in conv2, which has 64 outputs, resulting in a [7x7x64].

tl;dr: The image starts as 28x28, and each maxpool reduces it by a factor of two in each dimension. 28/2/2 = 7.