Q-learning vs dynamic programming

D_Wills picture D_Wills · Aug 17, 2016 · Viewed 10.4k times · Source

Is the classic Q-learning algorithm, using lookup table (instead of function approximation), equivalent to dynamic programming?

Answer

Pablo EM picture Pablo EM · Aug 17, 2016

From Sutton & Barto's book (Reinforcement Learning: An Introduction, chapter 4)

The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP). Classical DP algorithms are of limited utility in reinforcement learning both because of their assumption of a perfect model and because of their great computational expense, but they are still important theoretically.

So, although both share the same working principles (either using tabular Reinforcement Learning/Dynamic Programming or approximated RL/DP), the key difference between classic DP and classic RL is that the first assume the model is known. This basically means knowing the transition probabilities (which indicates the probability of change from state s to state s' given action a) and the expected immediate reward function.

On the contrary, RL methods only require to have access to a set of samples, either collected online or offline (depending on the algorithm).

Of course, there are hybrid methods that can be place between RL and DP, for example those that learn a model from the samples, and then use that model in the learning process.

NOTE: The term Dynamic Programming, in addition to a set of mathematical optimization techniques related with RL, is also used to refer a "general algorithmic pattern", as pointed in some comment. In both case, the fundaments are the same, but depending on the context may have a different meaning.