Tensorflow loss not decreasing, and the gradients of weights are close to zero

Mo.Explorer picture Mo.Explorer · Jun 1, 2017 · Viewed 10.1k times · Source

Recently, I am doing research on face alignment (facial landmark detection), and wanna do some further work on the open source Mnemonic Descent Method. Based on the code, I have done some modifications on the import of the samples. But there are some other problems that really confuse me for some time.

At first, the model is as follows

  patches = _extract_patches_module.extract_patches(images, tf.constant(patch_shape), inits+dx)
  patches = tf.stop_gradient(patches)
  patches = tf.reshape(patches, (batch_size, num_patches * patch_shape[0], patch_shape[1], num_channels))
  endpoints['patches'] = patches

  with tf.variable_scope('convnet', reuse=step>0):
      net = conv_model(patches)
      ims = net['concat']

  ims = tf.reshape(ims, (batch_size, -1))

  with tf.variable_scope('rnn', reuse=step>0) as scope:
      hidden_state = slim.ops.fc(tf.concat(1, [ims, hidden_state]), 512, activation=tf.tanh)
      prediction = slim.ops.fc(hidden_state, num_patches * 2, scope='pred', activation=None)
      endpoints['prediction'] = prediction`

and the conv_model is:

with tf.op_scope([inputs], scope, 'mdm_conv'):
with scopes.arg_scope([ops.conv2d, ops.fc], is_training=is_training):
  with scopes.arg_scope([ops.conv2d], activation=tf.nn.relu, padding='VALID'):
    net['conv_1'] = ops.conv2d(inputs, 32, [3, 3], scope='conv_1')
    net['pool_1'] = ops.max_pool(net['conv_1'], [2, 2])
    net['conv_2'] = ops.conv2d(net['pool_1'], 32, [3, 3], scope='conv_2')
    net['pool_2'] = ops.max_pool(net['conv_2'], [2, 2])

    crop_size = net['pool_2'].get_shape().as_list()[1:3]
    net['conv_2_cropped'] = utils.get_central_crop(net['conv_2'], box=crop_size)
    net['concat'] = tf.concat(3, [net['conv_2_cropped'], net['pool_2']])
    return net

The initial learning rate is set 1e-3 and the batch size is 60.

The first problem is that when the model is in the training process, the loss almost keeps the unchanged, namely the loss does not decrease generally, even the iteration go through over 10,000 steps. The case like this:

  2017-06-01 19:46:01.120850: step 3060, loss = 0.8852 (15.7 examples/sec; 3.830 sec/batch)
  2017-06-01 19:46:37.776494: step 3070, loss = 0.7375 (18.2 examples/sec; 3.291 sec/batch)
  2017-06-01 19:47:09.242257: step 3080, loss = 0.8160 (16.5 examples/sec; 3.635 sec/batch)
  2017-06-01 19:47:46.441860: step 3090, loss = 0.7973 (17.1 examples/sec; 3.501 sec/batch)
  2017-06-01 19:48:19.793012: step 3100, loss = 0.7228 (18.2 examples/sec; 3.292 sec/batch)
  2017-06-01 19:48:56.614480: step 3110, loss = 0.8687 (21.8 examples/sec; 2.750 sec/batch)
  2017-06-01 19:49:29.904451: step 3120, loss = 0.8662 (19.8 examples/sec; 3.024 sec/batch)
  2017-06-01 19:50:06.186441: step 3130, loss = 0.7927 (22.7 examples/sec; 2.648 sec/batch)
  2017-06-01 19:50:40.794964: step 3140, loss = 0.7585 (16.2 examples/sec; 3.711 sec/batch)
  2017-06-01 19:51:18.612637: step 3150, loss = 0.8264 (17.9 examples/sec; 3.348 sec/batch)
  2017-06-01 19:51:52.905742: step 3160, loss = 0.7504 (17.2 examples/sec; 3.498 sec/batch)
  2017-06-01 19:52:29.895365: step 3170, loss = 0.7569 (16.6 examples/sec; 3.615 sec/batch)
  2017-06-01 19:53:03.509374: step 3180, loss = 0.6869 (16.3 examples/sec; 3.692 sec/batch)
  2017-06-01 19:53:40.798535: step 3190, loss = 0.7592 (18.9 examples/sec; 3.180 sec/batch)
  2017-06-01 19:54:14.063566: step 3200, loss = 0.7689 (19.1 examples/sec; 3.136 sec/batch)
  2017-06-01 19:54:50.741630: step 3210, loss = 0.7345 (19.7 examples/sec; 3.040 sec/batch)

The loss function is :

def normalized_rmse(pred, gt_truth):
    norm = tf.sqrt(1e-12 + tf.reduce_sum(((gt_truth[:, 36, :] - gt_truth[:, 45, :])**2), 1))

    return tf.reduce_sum(tf.sqrt(1e-12 + tf.reduce_sum(tf.square(pred - gt_truth), 2)), 1) / (norm * 68)

In fact, there are more than 3,000 images in the training datasets. After the augmentation, each image augmented from the same image has some differences in some ways. Therefore, the samples in each batch are different.

However, when I train the model with only one image, the model can converge after about 1,000 steps, namely the loss can decrease apparently and be close to zero, which really confuses me now...

Then I use the tensorboard to visualize the results. The results are as follows:

the loss variation the loss variation

and

the gradients of the weights and bias the gradients of the weights and bias

The results also show that the loss dose not decrease generally. At the same time, it reveals the second problem: the gradients of the bias change apparently in the convolution model but weights' keep unchanged! Even the model is trained with only one image, which can converge after 1,000 steps, the corresponding weight gradients in the convolution model also keep unchanged...

I am a newcomer to tensorflow, and I have tried my best to solve these problems, but failed in the end... So I sincerely hope guys can help me out... Thank you so much!

Answer