I'm using the DQN algorithm to train an agent in my environment, that looks like this:
I have already adjusted some hyperparameters (network architecture, exploration, learning rate) which gave me some descent results, but still not as good as it should/could be. The rewards per epiode are increasing during training. The Q-values are converging, too (see figure 1). However, for all different settings of hyperparameter the Q-loss is not converging (see figure 2). I assume, that the lacking convergence of the Q-loss might be the limiting factor for better results.
Q-value of one discrete action durnig training
I'm using a target network which is updated every 20k timesteps. The Q-loss is calculated as MSE.
Do you have ideas why the Q-loss is not converging? Does the Q-Loss have to converge for DQN algorithm? I'm wondering, why Q-loss is not discussed in most of the papers.
Yes, the loss must coverage, because of the loss value means the difference between expected Q value and current Q value. Only when loss value converges, the current approaches optimal Q value. If it diverges, this means your approximation value is less and less accurate.
Maybe you can try adjusting the update frequency of the target network or check the gradient of each update (add gradient clipping). The addition of the target network increases the stability of the Q-learning.
In Deepmind's 2015 Nature paper, it states that:
The second modification to online Q-learning aimed at further improving the stability of our method with neural networks is to use a separate network for generating the traget yj in the Q-learning update. More precisely, every C updates we clone the network Q to obtain a target network Q' and use Q' for generating the Q-learning targets yj for the following C updates to Q. This modification makes the algorithm more stable compared to standard online Q-learning, where an update that increases Q(st,at) often also increases Q(st+1, a) for all a and hence also increases the target yj, possibly leading to oscillations or divergence of the policy. Generating the targets using the older set of parameters adds a delay between the time an update to Q is made and the time the update affects the targets yj, making divergence or oscillations much more unlikely.
Human-level control through deep reinforcement learning, Mnih et al., 2015
I've made an experiment for another person asked similar questions in the Cartpole environment, and the update frequency of 100 solves the problem (achieve a maximum of 200 steps).
When C (update frequency) = 2, Plotting of the average loss:
C = 10
C = 100
C = 1000
C = 10000
If the divergence of loss value is caused by gradient explode, you can clip the gradient. In Deepmind's 2015 DQN, the author clipped the gradient by limiting the value within [-1, 1]. In the other case, the author of Prioritized Experience Replay clip gradient by limiting the norm within 10. Here're the examples:
DQN gradient clipping:
optimizer.zero_grad()
loss.backward()
for param in model.parameters():
param.grad.data.clamp_(-1, 1)
optimizer.step()
PER gradient clipping:
optimizer.zero_grad()
loss.backward()
if self.grad_norm_clipping:
torch.nn.utils.clip_grad.clip_grad_norm_(self.model.parameters(), 10)
optimizer.step()