Implement Policy Iteration in Python – A Minimal Working Example

Learn about this classical Dynamic Programming algorithm to optimally solve Markov Decision Process models

Dec 12, 2021

6 min read

A few days ago I wrote an article on value iteration (Richard Bellman, 1957), today it is time for policy iteration (Ronald Howard, 1960). Policy iteration is an exact algorithm to solve Markov Decision Process models, being guaranteed to find an optimal policy. Compared to value iteration, a benefit is having a clear stopping criterion – once the policy is stable, it is provably optimal. However, it often has a higher computational burden for problems with many states.

Policy iteration is at the root of many modern reinforcement learning algorithms, more specifically the class of policy approximation algorithms. In fact, Bertsekas & Tsitsiklis (1996) interpret it as an actor-critic model, combining policy updates with value functions. Before moving to your Proximal Policy Optimizations and Natural Policy Gradients, it make sense to grasp this fundamental algorithm first.

The Four Policy Classes of Reinforcement Learning

Policy iteration

Compared to value iteration, policy iteration is a bit more extensive. Sutton & Barto (2019) break it down into three steps.

👁 Policy iteration algorithm [source: Sutton & Barto (publicly available), 2019]

Policy iteration algorithm [source: Sutton & Barto (publicly available), 2019]

Step 1 – initialization – sets (possibly arbitrarily) both value functions and the policy. The value functions give the perceived value of being in each state, the policy is a rule that returns an action for any given state.

In Step 2 – policy evaluation – the value for each state is determined in a way very similar to value iteration. However, instead of determining values by maximizing over all actions, we simply compute values using the current policy. In short, we multiply the value for action a=π(s) (direct reward r+ downstream value V(s')) with the transition probability p. For more details – such as the error toleranceθ— please check out the article on value iteration.

Implement Value Iteration in Python – A Minimal Working Example

Step 3 – policy improvement – seeks to improve the policy using the prevailing value functions. For every state, it verifies whether the action suggested by the current policy indeed is the value-maximizing action. If not, the policy is updated and Step 2 is repeated. The algorithm alternates between both steps until the policy remains stable. At that point, the optimal policy has been reached and the algorithm terminates.

For a direct comparison between value iteration and policy iteration, see them side-by-side here.

👁 Comparison between policy iteration (left) and value iteration (right). The key difference is that policy iteration segregates valuation and policy updates. [Adapted from Sutton & Barto (publicly available), 2019]

Comparison between policy iteration (left) and value iteration (right). The key difference is that policy iteration segregates valuation and policy updates. [Adapted from Sutton & Barto (publicly available), 2019]

A minimal working example

Time for a coding example. I like to demonstrate algorithms with very simple problems, as evidenced by the brief outline below.

The problem

The problem of interest is a one-dimensional world (a row of tiles). In the middle, there is a tile that serves as terminating state. Landing on this tile yields a reward of +10, stepping between tiles costs -1 per move. The agent can decide to move left or right, but ends up going into the wrong direction 10% of the time. With a direct reward, an expected downstream reward, and transition probabilities, it has the basic elements of an MDP.

The algorithm

The Python algorithm is not too different from the mathematical procedure shown earlier. Note that the maximum numbers of iterations are simplifications to cap computational effort. In particular, if two or more policies perform equally well, the algorithm may endlessly cycle between them.

Some experiments

Time for a bit of experimentation. I will provide some computational examples as well as a bit of reflection on the results.

We set all value functions V(s)=0, and initialize the policy at π=[0,0,0,0,0] (i.e., always move left).

Then, we move towards policy evaluation. To reiterate: in this step, we determine state values under the current policy π. The policy evaluation is an iterative step, repeated until all values are under the error threshold θ. In this case (state 0, iteration 1), the initial estimate V(0) is 0. As the direct reward is -1, the error Δ is also 1 and we need to repeat.

👁 First policy evaluation step. During policy evaluation, we compute values for each state under the prevailing policy. The evaluation step is repeated until values are within the pre-defined error bounds.

First policy evaluation step. During policy evaluation, we compute values for each state under the prevailing policy. The evaluation step is repeated until values are within the pre-defined error bounds.

It takes some time to get to the right values. If we take θ= 1e-10 , we need no less than 185 iterations before Δ<θ for all states. We obtain V=[-8.62, -7.08, 10.00, 7.61, 5.68].

Now, we move on to the policy improvement. We test whether our initial policy (‘always go left’) is indeed optimal – spoiler alert: it’s not – given the values we just determined.

👁 First policy improvement step for state 1. Given the state values determined earlier, we verify whether the current policy is optimal or a different action would yield better results. If we can improve the current policy by taking a different action, we update the policy and return to policy evaluation.

First policy improvement step for state 1. Given the state values determined earlier, we verify whether the current policy is optimal or a different action would yield better results. If we can improve the current policy by taking a different action, we update the policy and return to policy evaluation.

Evidently, at the leftmost tile, going right is better than going left. Thus, the stopping criterion is not met. We update π and return to the policy evaluation step to recompute the values.

Ultimately, we should arrive at a point where the policy no longer changes. In this case it’s just one more cycle. Having updated the policy, we compute the new state values (16 iterations) during policy evaluation and make no policy changes in the subsequent policy improvement step. The algorithm has terminated, yielding π=[1,1,0,0,0].

Other than that, there’s not much to say actually. There’s a computational tradeoff between θ and the number of cycles between Step 2 and Step 3, but for this particular problem it hardly matters.

Although in many ways similar to value iteration, note that policy iteration returns a policy and value iteration a set of value functions. When moving towards approximate (reinforcement learning) algorithms, this distinction becomes relevant. You can have a good policy without having great value functions (or even knowing them) and you can have accurate value functions without an explicit understanding of the policy itself.

Final words

Like value iteration, policy iteration is a fundamental algorithm from which many approximate algorithms are derived. Make sure to understand policy iteration, before truly diving into the world of reinforcement learning.

Some more minimal working examples you might be interested in:

A Minimal Working Example for Deep Q-Learning in TensorFlow 2.0

A Minimal Working Example for Continuous Policy Gradients in TensorFlow 2.0

A Minimal Working Example for Discrete Policy Gradients in TensorFlow 2.0

References

Bellman, R. (1957). A Markovian decision process. Journal of mathematics and mechanics, 6(5), 679–684.

Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Athena Scientific.

Howard, R. A. (1960). Dynamic programming and Markov processes.

Sutton, R.S., and Barto, A.G. (2019). Reinforcement learning: An introduction. MIT Press.

Written By

Wouter van Heeswijk, PhD

See all from Wouter van Heeswijk, PhD

Dynamic Programming, Machine Learning, Markov Decision Process, Policy Iteration, Reinforcement Learning

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/implement-policy-iteration-in-python-a-minimal-working-example-6bf6cc156ca9/