Implement Policy Iteration in Python – A Minimal Working Example
Learn about this classical Dynamic Programming algorithm to optimally solve Markov Decision Process models
A few days ago I wrote an article on value iteration (Richard Bellman, 1957), today it is time for policy iteration (Ronald Howard, 1960). Policy iteration is an exact algorithm to solve Markov Decision Process models, being guaranteed to find an optimal policy. Compared to value iteration, a benefit is having a clear stopping criterion – once the policy is stable, it is provably optimal. However, it often has a higher computational burden for problems with many states.
Policy iteration is at the root of many modern reinforcement learning algorithms, more specifically the class of policy approximation algorithms. In fact, Bertsekas & Tsitsiklis (1996) interpret it as an actor-critic model, combining policy updates with value functions. Before moving to your Proximal Policy Optimizations and Natural Policy Gradients, it make sense to grasp this fundamental algorithm first.
Policy iteration
Compared to value iteration, policy iteration is a bit more extensive. Sutton & Barto (2019) break it down into three steps.
Step 1 – initialization – sets (possibly arbitrarily) both value functions and the policy. The value functions give the perceived value of being in each state, the policy is a rule that returns an action for any given state.
In Step 2 – policy evaluation – the value for each state is determined in a way very similar to value iteration. However, instead of determining values by maximizing over all actions, we simply compute values using the current policy. In short, we multiply the value for action a=π(s) (direct reward r+ downstream value V(s')) with the transition probability p. For more details – such as the error toleranceθ— please check out the article on value iteration.
Implement Value Iteration in Python – A Minimal Working Example
Step 3 – policy improvement – seeks to improve the policy using the prevailing value functions. For every state, it verifies whether the action suggested by the current policy indeed is the value-maximizing action. If not, the policy is updated and Step 2 is repeated. The algorithm alternates between both steps until the policy remains stable. At that point, the optimal policy has been reached and the algorithm terminates.
For a direct comparison between value iteration and policy iteration, see them side-by-side here.
A minimal working example
Time for a coding example. I like to demonstrate algorithms with very simple problems, as evidenced by the brief outline below.
The problem
The problem of interest is a one-dimensional world (a row of tiles). In the middle, there is a tile that serves as terminating state. Landing on this tile yields a reward of +10, stepping between tiles costs -1 per move. The agent can decide to move left or right, but ends up going into the wrong direction 10% of the time. With a direct reward, an expected downstream reward, and transition probabilities, it has the basic elements of an MDP.
The algorithm
The Python algorithm is not too different from the mathematical procedure shown earlier. Note that the maximum numbers of iterations are simplifications to cap computational effort. In particular, if two or more policies perform equally well, the algorithm may endlessly cycle between them.
Some experiments
Time for a bit of experimentation. I will provide some computational examples as well as a bit of reflection on the results.
We set all value functions V(s)=0, and initialize the policy at π=[0,0,0,0,0] (i.e., always move left).
Then, we move towards policy evaluation. To reiterate: in this step, we determine state values under the current policy π. The policy evaluation is an iterative step, repeated until all values are under the error threshold θ. In this case (state 0, iteration 1), the initial estimate V(0) is 0. As the direct reward is -1, the error Δ is also 1 and we need to repeat.
It takes some time to get to the right values. If we take θ= 1e-10 , we need no less than 185 iterations before Δ<θ for all states. We obtain V=[-8.62, -7.08, 10.00, 7.61, 5.68].
Now, we move on to the policy improvement. We test whether our initial policy (‘always go left’) is indeed optimal – spoiler alert: it’s not – given the values we just determined.
Evidently, at the leftmost tile, going right is better than going left. Thus, the stopping criterion is not met. We update π and return to the policy evaluation step to recompute the values.
Ultimately, we should arrive at a point where the policy no longer changes. In this case it’s just one more cycle. Having updated the policy, we compute the new state values (16 iterations) during policy evaluation and make no policy changes in the subsequent policy improvement step. The algorithm has terminated, yielding π=[1,1,0,0,0].
Other than that, there’s not much to say actually. There’s a computational tradeoff between θ and the number of cycles between Step 2 and Step 3, but for this particular problem it hardly matters.
Although in many ways similar to value iteration, note that policy iteration returns a policy and value iteration a set of value functions. When moving towards approximate (reinforcement learning) algorithms, this distinction becomes relevant. You can have a good policy without having great value functions (or even knowing them) and you can have accurate value functions without an explicit understanding of the policy itself.
Final words
Like value iteration, policy iteration is a fundamental algorithm from which many approximate algorithms are derived. Make sure to understand policy iteration, before truly diving into the world of reinforcement learning.
Some more minimal working examples you might be interested in:
A Minimal Working Example for Deep Q-Learning in TensorFlow 2.0
A Minimal Working Example for Continuous Policy Gradients in TensorFlow 2.0
A Minimal Working Example for Discrete Policy Gradients in TensorFlow 2.0
References
Bellman, R. (1957). A Markovian decision process. Journal of mathematics and mechanics, 6(5), 679–684.
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Athena Scientific.
Howard, R. A. (1960). Dynamic programming and Markov processes.
Sutton, R.S., and Barto, A.G. (2019). Reinforcement learning: An introduction. MIT Press.
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS