What is the difference between value iteration and policy iteration?

Arslán picture Arslán · May 22, 2016 · Viewed 60.6k times · Source

In reinforcement learning, what is the difference between policy iteration and value iteration?

As much as I understand, in value iteration, you use the Bellman equation to solve for the optimal policy, whereas, in policy iteration, you randomly select a policy π, and find the reward of that policy.

My doubt is that if you are selecting a random policy π in PI, how is it guaranteed to be the optimal policy, even if we are choosing several random policies.

Answer

zyxue picture zyxue · Feb 27, 2017

Let's look at them side by side. The key parts for comparison are highlighted. Figures are from Sutton and Barto's book: Reinforcement Learning: An Introduction.

enter image description here Key points:

  1. Policy iteration includes: policy evaluation + policy improvement, and the two are repeated iteratively until policy converges.
  2. Value iteration includes: finding optimal value function + one policy extraction. There is no repeat of the two because once the value function is optimal, then the policy out of it should also be optimal (i.e. converged).
  3. Finding optimal value function can also be seen as a combination of policy improvement (due to max) and truncated policy evaluation (the reassignment of v_(s) after just one sweep of all states regardless of convergence).
  4. The algorithms for policy evaluation and finding optimal value function are highly similar except for a max operation (as highlighted)
  5. Similarly, the key step to policy improvement and policy extraction are identical except the former involves a stability check.

In my experience, policy iteration is faster than value iteration, as a policy converges more quickly than a value function. I remember this is also described in the book.

I guess the confusion mainly came from all these somewhat similar terms, which also confused me before.