What difference to the algorithm does it make having a big or small gamma value? In my optic, as long as it is neither 0 or 1, it should work exactly the same. On the other side, whatever gamma I choose, it seems the Qvalues get pretty close to zero really quickly(I'm having here values on the order of 10^-300 just in a quick test). How do usually people plot Qvalues (i'm plotting a (x, y, best QValue for that state) given that problem? I'm trying to get around with logarithms but even then it feels kinda awkward.
Also, I don't get what is the reason behind having and alpha parameter in the Q Learning update function. It basically sets the magnitude of the update we are going to make to the Q value function. I have the idea that it is usually decreased over time. What is the interest in having it decrease over time? An update value in the beginning should have more importance than 1000 episodes later?
Also, I was thinking that a good idea for exploring the state space every time the agent doesn't want to do the greedy action would be to explore any state that still has a zero QValue(this means, at least most of the times, a state never before done), but I don't see that referred in any literature. Are there any downsides to this? I know this can't be used with (at least some) generalization functions.
Other idea would be to keep a table of visited states/actions, and try to do the actions that were tried less times before in that state. Of course this can only be done in relatively small state spaces(in my case it is definitely possible).
A third idea for late in the exploration process would be to look not only to the selected action looking for the best qvalues but also look inside all those actions possible and that state, and then in the others of that state and so.
I know those questions are kinda unrelated but I'd like to hear the opinions of people that have worked before with this and (probably) struggled with some of them too.
From a reinforcement leaning masters candidate:
Alpha is the learning rate. If the reward or transition function is stochastic (random), then alpha should change over time, approaching zero at infinity. This has to do with approximating the expected outcome of a inner product (T(transition)*R(reward)), when one of the two, or both, have random behavior.
That fact is important to note.
Gamma is the value of future reward. It can affect learning quite a bit, and can be a dynamic or static value. If it is equal to one, the agent values future reward JUST AS MUCH as current reward. This means, in ten actions, if an agent does something good this is JUST AS VALUABLE as doing this action directly. So learning doesn't work at that well at high gamma values.
Conversely, a gamma of zero will cause the agent to only value immediate rewards, which only works with very detailed reward functions.
Also - as for exploration behavior... there is actually TONS of literature on this. All of your ideas have, 100%, been tried. I would recommend a more detailed search, and to even start googling Decision Theory and "Policy Improvement".
Just adding a note on Alpha: Imagine you have a reward function that spits out 1, or zero, for a certain state action combo SA. Now every time you execute SA, you will get 1, or 0. If you keep alpha as 1, you will get Q-values of 1, or zero. If it's 0.5, you will get values of +0.5, or 0, and the function will always oscillate between the two values for ever. However, if everytime you decrease your alpha by 50 percent, you get values like this. (assuming reward is recieved 1,0,1,0,...). Your Q-values will end up being, 1,0.5,0.75,0.9,0.8,.... And will eventually converge kind of close to 0.5. At infinity it will be 0.5, which is the expected reward in a probabilistic sense.