DQN - Q-Loss not converging

user8861893 picture user8861893 · Oct 31, 2017 · Viewed 10.9k times · Source

I'm using the DQN algorithm to train an agent in my environment, that looks like this:

  • Agent is controlling a car by picking discrete actions (left, right, up, down)
  • The goal is to drive at a desired speed without crashing into other cars
  • The state contains the velocities and positions of the agent's car and the surrounding cars
  • Rewards: -100 for crashing into other cars, positive reward according to the absolute difference to the desired speed (+50 if driving at desired speed)

I have already adjusted some hyperparameters (network architecture, exploration, learning rate) which gave me some descent results, but still not as good as it should/could be. The rewards per epiode are increasing during training. The Q-values are converging, too (see figure 1). However, for all different settings of hyperparameter the Q-loss is not converging (see figure 2). I assume, that the lacking convergence of the Q-loss might be the limiting factor for better results.

Q-value of one discrete action durnig training

Q-loss during training

I'm using a target network which is updated every 20k timesteps. The Q-loss is calculated as MSE.

Do you have ideas why the Q-loss is not converging? Does the Q-Loss have to converge for DQN algorithm? I'm wondering, why Q-loss is not discussed in most of the papers.

Answer

Alexander picture Alexander · Nov 7, 2019

Yes, the loss must coverage, because of the loss value means the difference between expected Q value and current Q value. Only when loss value converges, the current approaches optimal Q value. If it diverges, this means your approximation value is less and less accurate.

Maybe you can try adjusting the update frequency of the target network or check the gradient of each update (add gradient clipping). The addition of the target network increases the stability of the Q-learning.

In Deepmind's 2015 Nature paper, it states that:

The second modification to online Q-learning aimed at further improving the stability of our method with neural networks is to use a separate network for generating the traget yj in the Q-learning update. More precisely, every C updates we clone the network Q to obtain a target network Q' and use Q' for generating the Q-learning targets yj for the following C updates to Q. This modification makes the algorithm more stable compared to standard online Q-learning, where an update that increases Q(st,at) often also increases Q(st+1, a) for all a and hence also increases the target yj, possibly leading to oscillations or divergence of the policy. Generating the targets using the older set of parameters adds a delay between the time an update to Q is made and the time the update affects the targets yj, making divergence or oscillations much more unlikely.

Human-level control through deep reinforcement learning, Mnih et al., 2015

I've made an experiment for another person asked similar questions in the Cartpole environment, and the update frequency of 100 solves the problem (achieve a maximum of 200 steps).

When C (update frequency) = 2, Plotting of the average loss: C=2

C = 10

C=10

C = 100

enter image description here

C = 1000

enter image description here

C = 10000

enter image description here

If the divergence of loss value is caused by gradient explode, you can clip the gradient. In Deepmind's 2015 DQN, the author clipped the gradient by limiting the value within [-1, 1]. In the other case, the author of Prioritized Experience Replay clip gradient by limiting the norm within 10. Here're the examples:

DQN gradient clipping:

    optimizer.zero_grad()
    loss.backward()
    for param in model.parameters():
        param.grad.data.clamp_(-1, 1)
    optimizer.step()

PER gradient clipping:

    optimizer.zero_grad()
    loss.backward()
    if self.grad_norm_clipping:
       torch.nn.utils.clip_grad.clip_grad_norm_(self.model.parameters(), 10)
   optimizer.step()