VOOZH about

URL: https://towardsdatascience.com/cliff-walking-problem-with-the-discrete-policy-gradient-algorithm-59d1900d80d8/

⇱ Cliff-Walking Problem With The Discrete Policy Gradient Algorithm | Towards Data Science


Cliff-Walking Problem With The Discrete Policy Gradient Algorithm

A full implementation of the REINFORCE algorithm in Python. The steps are performed manually to illustrate the inner workings of the…

8 min read
👁 Photo by Jennifer Capel on Unsplash
Photo by Jennifer Capel on Unsplash

The textbook cliff walking problem continues to fascinate. Although exceedingly simple, it elucidates many interesting aspects of reinforcement learning algorithms. After treating some value-based implementation (SARSA and Q-learning [here](https://towardsdatascience.com/deep-q-learning-for-the-cliff-walking-problem-b54835409046), Deep Q-learning here), now it is time to move to a policy-based implementation. Although many libraries exist nowadays, we manually implement the procedure in Python to demonstrate the mechanisms behind the discrete policy gradient algorithm. We close with a head-to-head comparison with SARSA.

Policy Gradients In Reinforcement Learning Explained

The cliff walking problem

Let’s sketch the problem first. Our world is a fairly comprehensive grid, except for a lethal cliff that should be avoided. The agent starts at the left-hand side of the cliff, the target (unknown to the agent) awaits at the right-hand side. The objective is to find the shortest path to the target without falling into the cliff, which directly ends the episode.

👁 Cliff world [image by author]
Cliff world [image by author]

Various reinforcement learning solutions exist. This article focuses on the policy gradient algorithm. Simply speaking, a policy tells the agent what to do in a given state – more direct than computing and utilizing value functions corresponding to each action. In this setting, you might be inclined to view the policy as a table stating the best action at each of the 48 tiles.

It’s a bit more complicated though, as the method relies on stochastic policies. Rather than simply telling us the best action, the policy provides a probability with which we should take each action. Without going into too much detail, stochastic policies often make sense in unknown environments and provide a built-in mechanism for exploration. Probabilities for bad actions should converge to near-zero.

Discrete policies— Softmax function

Time to formalize the policy π a bit more. We use a softmax policy, which effectively maps any series of real values to a series of probabilities. As input it requires a feature vector and a vector of corresponding weights.

As feature vector ϕ(s,a), we define a 48⋅4 vector, with each four elements representing the moves on any given tile – essentially a lookup table (alternatively a post-decision state approach might be used, requiring only 48 features but losing detail). The vector is one-hot encoded, meaning we have a vector with 191 zeros and 1 one corresponding to action a in state s. As said, θ is a vector of weights, one for each (s,a) pair. The dot product of ϕ(s,a) and θ captures the value of being on a specific tile, comparable to the well-known Q-value.

Now, we can formalize the softmax policy as follows:

👁 Softmax policy.
Softmax policy.

The larger ϕ(s,a)^⊤ θ, the higher the probability of taking action a. In normal language: with high probability we take steps that bring us closer to the target. But before we know which steps those are, we first need to learn the policy.

Discrete policy gradient

Having defined the policy itself, let’s figure out how to update it. This is where the policy gradient theorem comes in. To preserve the article scope, we restrict ourselves to the weight update procedure:

👁 Update function for policy gradient methods
Update function for policy gradient methods

Here, α is simply the learning rate. The reward function _Gt is also straightforward:

👁 Cumulative reward function
Cumulative reward function

The main challenge is in that nasty _∇_θ log(πθ(a|s)) – also known as the score function – how do we translate that to an implementable rule? Mathematically it is quite some work (detailed derivation here), so let’s skip straight to the end result:

👁 Score function for discrete policy gradient. The score function is the feature vector for the selected action minus the weighted feature vector over all actions.
Score function for discrete policy gradient. The score function is the feature vector for the selected action minus the weighted feature vector over all actions.

The bottom equation is typically presented as the end results in textbooks, but I actually find the other equations more useful.

Let’s give an example. Working with a 48-element vector is a bit inconvenient, so assume we have a feature vector with only four elements. First, our one-hot vector ϕ(s,a) might look like [0 1 0 0] – representing the action we take. Second, we compute our expectation, which essentially is a weighted feature vector. Suppose we have a probability vector [0.2 0.5 0.1 0.2], with each action leading us to a different tile (state). All actions are one-hot encoded, so the expectation can be written out by

weighted_phi: [1 0 0 0]*0.2 + [0 1 0 0]*0.5 + [0 0 1 0]*0.1 + [0 0 0 1]*0.2 = [0.2 0.5 0.1 0.2]

Yes, that’s exactly the same as the probability vector itself. It’s not always this trivial – the feature vector is not necessarily a one-hot vector – but you get the idea. Now, we only need to subtract the expectation from the feature vector to obtain

score function: [0 1 0 0] - [0.2 0.5 0.1 0.2] = [-0.2 0.5 -0.1 -0.2]

To recap, we took the feature vector corresponding to our chosen action and subtracted the weighted feature vector over all actions, obtaining the score of the action.

To perform the weight update Δθ, we multiply the score function with the reward _Gt and learning rate α. For instance:

Weight update: Δθ = α * score_function * G_t = 0.01 *[-0.2 0.5 -0.1 -0.2]*10 = [-0.02 0.05 -0.01 -0.02]

Intuitively, the implications are clear. If we multiply a positive score (action 2) by a positive reward, that probability increases and we will take the action more often in the future. More subtle: the stronger the reward signal, the larger the updates. Gradually, the probabilities will converge to a local optimum at worst, and the global optimum at best.

Python implementation

Back to our cliff walking problem. Initially, we have no clue what actions are good, so it makes sense to initialize probabilities equally (simply set all θ to 0). Our agent will stumble around for some time, but ultimately should hit the target. This gives a strong reward signal for the penultimate tile (increasing the probability of moving down), and from there it’s a chain reaction.

Note that, although the method is policy-based rather value-based, reward trajectories implicitly retain a central role in the entire procedure. Updates only take place at the end of the episode, so if we took 20 steps we perform 20 updates that trace back the full trajectory (i.e., Monte Carlo learning).

The full implementation can be found on my GitHub, here I focus on the most relevant snippets (cleaned up a bit for readability).

First, let’s define the policy:

The corresponding softmax may be expressed as follows. Practical implementations typically involve scaling – the exponent may get very large.

The reward function _Gt is computed for each decision moment t in the stored trajectory, using corresponding discounting. Typically, you’d also want to perform some reward normalization here.

Finally, the score function _∇_θ log(πθ(a|s)), required to perform the update _Δθ=α⋅ G_t⋅∇_θ log(πθ(a|s)). Remind that we simply subtract the weighed feature vector (all actions) from the actual feature vector (selected action).

That’s all we need. Now let’s run it, using SARSA (another on-policy method, but value-based) for comparison.

👁 Numbers of steps for 10,000 episodes, training with policy gradient and SARSA. SARSA converges quicker, to a better solution, and experiences less noise. [image by author]
Numbers of steps for 10,000 episodes, training with policy gradient and SARSA. SARSA converges quicker, to a better solution, and experiences less noise. [image by author]
👁 Trained paths for policy gradient and SARSA. Due to higher exploration, the gradient agent takes a longer detour. Neither takes the shortest path, both being on-policy methods with embedded exploration. [image by author]
Trained paths for policy gradient and SARSA. Due to higher exploration, the gradient agent takes a longer detour. Neither takes the shortest path, both being on-policy methods with embedded exploration. [image by author]

It is directly obvious that the policy gradient updates are much messier than the SARSA updates. This is partially inherent to the stochastic policy (our SARSA agent only explores 5% of the time, the gradient method explores a lot more) – even after 10,000 iterations the agent takes a vast detour around the cliff to counter the adverse effects of random moves.

Additionally, SARSA uses temporal differences and the policy gradient uses the full reward trajectory – the latter has much higher variance. Consequently, we also need a lower learning rate α to acquire stable results.

Although the policy gradient method doesn’t come out favorably here, it’s not meant as a dismissal. Like any algorithm, it’s a great choice for some problems, a poor one for others. The results only serve as an illustration of typical behavior of REINFORCE, which sadly involves high variability and convergence to local optima. As always, proper tuning, reward normalization, feature design etc. go a long way in improving performance.

But for now, let’s just rejoice that we braved the cliff once more.


The full code, including other RL solutions to the cliff walking problem, can be found on my GitHub:

GitHub – woutervanheeswijk/cliff_walking_public: Cliff walking reinforcement learning example, with…

Takeaways

  • To solve the cliff walking problem we can adopt a discrete policy, selecting actions using a softmax policy.
  • Updates rely on the score function, which is the feature vector ϕ(s,a) corresponding to the selected action minus a weighted feature vector ϕ(s,⋅) over all actions.
  • Gradually, the policy converges to a high probability for the best action per state (tile in this case).
  • Due to high variability of reward trajectories and high exploration, convergence is slower than for value-based methods in this basic implementation.

The implementation for Q-learning and SARSA:

Walking Off The Cliff With Off-Policy Reinforcement Learning

Implementation for Deep Q-learning:

Deep Q-Learning for the Cliff Walking Problem

Catching up on the foundations of discrete policy gradients first? Check out my articles with minimum working example and theoretical derivation:

A Minimal Working Example for Discrete Policy Gradients in TensorFlow 2.0

Policy Gradients In Reinforcement Learning Explained

References

Benderski, E. (2016). The softmax function and its derivative. https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/

Siegel, E. (2029). Policy gradient. https://siegel.work/blog/PolicyGradient/

Silver, D. (2020). Lecture 7: Policy gradient https://www.davidsilver.uk/wp-content/uploads/2020/03/pg.pdf

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press.

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4):229–256.

Yu, E (2017). Policy gradients from scratch with Python. http://quant.am/cs/2017/08/07/policy-gradients/


Written By

Wouter van Heeswijk, PhD

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles