Temporal-Difference Learning: Combining Dynamic Programming and Monte Carlo Methods for Reinforcement Learning

Milestones of RL: Q-Learning and Double Q-Learning

Oct 17, 2024

17 min read

We continue our deep dive of Sutton’s book "Reinforcement Learning: An Introduction" [1], and in this post introduce Temporal-Difference (TD) Learning, which is Chapter 6 of said work.

TD learning can be viewed as a combination of Dynamic Programming (DP) and Monte Carlo (MC) methods, which we introduced in the previous two posts, and marks an important milestone in the field of Reinforcement Learning (RL) – combining the strength of aforementioned methods: TD learning does not need a model and learns from experience alone, similar to MC, but also "bootstraps" – uses previously established estimates – similar to DP.

👁 Photo by Brooke Campbell on Unsplash

Photo by Brooke Campbell on Unsplash

Here, we will introduce this family of methods, both from a theoretical standpoint but also showing relevant practical algorithms, such as Q-learning – accompanied with Python code. As usual, all code can be found on GitHub.

We begin with an introduction and motivation, and then start with the prediction problem – similar to the previous posts. Then, we dive deeper in the theory and discuss which solution TD learning finds. Following that, we move to the control problem, and present a sequence of fundamental algorithms: Sarsa, Q-learning, Expected Sarsa and Double Q-learning. We conclude the post with some practical tips.

Introduction to Temporal-Difference Learning

As already stated, TD methods can be viewed as a combination of DP and MC. Let’s quickly recap these: DP methods require a model, and iteratively aim to improve their estimate of the value function. For this, they turn the Bellman equation into an update rule, and compute:

👁 Image from [1]

Image from [1]

That is, they use the previously estimated value function as a starting point for an updated estimate – averaging over all actions and the corresponding returns plus the previous value estimate – they bootstrap.

Conversely, MC methods can learn from experience alone, which is a stunning realization. To do so, they play out full episodes, and then estimate a value function by "simply" averaging the observed returns:

👁 Image from [1]

Image from [1]

TD methods kind of combine the best of both worlds: they bootstrap, i.e. can be implemented fully incremental – but do not need a model of the environment and learn from experience alone, too. A simple update rule could be:

👁 Image from [1]

Image from [1]

For this, we could sample episodes following some policy, and after every observed return we would update our value estimate as a weighted sum of the previous estimate and the value we now believe to be more accurate: the observed reward plus the estimated value of the next state.

We will encounter the quantity on the right multiple times from here on, it is the TD error:

👁 Image from [1]

Image from [1]

TD Prediction

Let’s put this together to a first prediction algorithm, dubbed TD(0):

👁 Image from [1]

Image from [1]

This should formalize what we introduced in the previous paragraph: goal is estimating the value function for a policy π. To do so, we generate episodes following this policy, and update the value estimate with the formula introduced above – until we are satisfied with the accuracy of the estimate (e.g. until convergence).

Above method is called TD(0) – or one-step TD – because we do one step in the environment and immediately update our value estimate. Later in the book Sutton introduces n-step TD and TD(λ), which seamlessly unify DP and MC methods.

To conclude this section, let’s re-iterate why TD methods are advantageous. First, assuming access to a (complete) model of the environment, as is required by DP, is a strong assumption. Thus, learning from experience alone is great. However, waiting for full episodes to finish, as is done by MC – is another heavy burden on the algorithm. Just think of some long games, such as Go – playing out each game termination is (near) infeasible – which explains the advantage of bootstrapping.

Convergence of TD(0)

Let’s now analyze TD a bit closer – and especially answer the questions: does it converge? And if yes, to which solution? The first question we won’t answer in detail here, but instead just give the answer: yes – TD methods converge, and they converge to a satisfying solution. A solution we will now describe. Surprisingly, the found solution is different to what for example MC methods find.

For this, assume we have some fixed dataset of episodes we sample from – and we iteratively feed these in batches to our algorithm until convergence – we call this a "batched" setup (for non-batched TD(0), the solution is not exactly the same – but we optimize towards the same direction).

We will deduce this using the example given in [1]. Assume, you observe the following episodes from some (unknown) MDP:

👁 Image from [1]

Image from [1]

The sequences are written as state – reward pairs, i.e. the first episode starts in state A, transitioned to B with a reward of 0, and then terminates in B with a reward of 0. What would be the state estimate for A following a MC method? We have observed A only once, with a reward of 0 – ergo, the value estimate will be 0.

However, would you also come to this conclusion? Maybe not. You might have built the following diagram, the imagined MDP, in your mind:

👁 Image from [1]

Image from [1]

From the data we have seen, A always precedes B. And since B’s estimated value is 3/4 (6/8), ignoring discounting we should also estimate 3/4 for A. This is exactly what TD(0) does.

The reason for this is that MC finds that solution which minimizes the (squared) error on the train set – and predicting 0 for A is the best we can do given this metric and data. On the other hand, TD learning finds the maximum likelihood estimate for the parameters of the Markov process most likely generating this data – and that is above diagram. This might also explain why TD(0) (usually) converges faster, and can find better solutions.

TD Control

After reading about the basics of TD learning and some theoretical properties (of the prediction problem), let’s come to the control problem, i.e. actually finding a policy using these methods. In this section we will introduce four methods for this, among others the famous Q-learning algorithm.

Before we start, some general remarks though – which are shared between our discussion of MC methods, and were introduced more lengthily in the previous post:

For the following algorithms, instead of estimating value functions we instead switch to estimating values for state-action pairs, which define the action-value function.
For methods involving sampling it is crucial, that all state-action pairs keep on being visited. To do so, we can e.g. work with ε-greedy policies for on-policy methods, or use off-policy methods in which the behavior policy keeps exploring.

Sarsa

The first on-policy method we look at is Sarsa, a relatively famous and fundamental algorithm in the space. Similar to MC methods, we play full episodes – but now instead update the value estimate after every timestep (TD learning), instead of only after every episode.

As mentioned, we don’t work with the value function though, but with the action-value function. Thus above stated update rule translates to:

👁 Image from [1]

Image from [1]

The name of this algorithm stems from the quintuple which describes two consecutive timesteps:

👁 Image from [1]

Image from [1]

All of these values are used in the update step.

Let’s have a look at the pseudocode:

👁 Image from [1]

Image from [1]

It should read pretty natural after above introduction, and also translates to Python code in a straight-forward manner:

import random

from gym_utils import get_observation_action_space
import numpy as np

from env import ParametrizedEnv

ALPHA = 0.1
NUM_STEPS = 1000

def get_eps_greedy_action(q_values: np.ndarray, eps: float = 0.05) -> np.integer:
 if random.uniform(0, 1) < eps or np.all(q_values == q_values[0]):
 return np.random.choice([a for a in range(len(q_values))])
 else:
 return np.argmax(q_values)

def sarsa(env: ParametrizedEnv) -> np.ndarray:
 observation_space, action_space = get_observation_action_space(env)
 Q = np.zeros((observation_space.n, action_space.n))

 for _ in range(NUM_STEPS):
 observation, _ = env.env.reset()
 terminated = truncated = False
 action = get_eps_greedy_action(Q[observation])

 while not terminated and not truncated:
 observation_new, reward, terminated, truncated, _ = env.env.step(action)
 action_new = get_eps_greedy_action(Q[observation_new])
 q_next = Q[observation_new, action_new] if not terminated else 0
 Q[observation, action] = Q[observation, action] + ALPHA * (
 reward + env.gamma * q_next - Q[observation, action]
 )
 observation = observation_new
 action = action_new

 return np.array([np.argmax(Q[s]) for s in range(observation_space.n)])

There are just a few things we need to talk about (and some we will discuss in more details in the "Advanced" section towards the end):

Similar as in the post about MC methods, we initialize the Q table to zeros, and then randomly break ties in get_eps_greedy_action. Sutton do so randomly, but require Q to be set to 0 for terminal states. Further, in the update step we set the Q value of terminal states to 0.

And similar to the previous post, we can run this algorithm within the framework defined there:

python grid_world.py --method=sarsa

This will spawn a grid world environment and use Sarsa to find a way through it.

Q-Learning

Next we come to Q-learning, which truly can be considered a breakthrough in the field of RL. It is an off-policy algorithm, which simplified the analysis and convergence proofs – and also lead to groundbreaking results for applying RL algorithms to "real-world" problems with deep neural networks [2].

Let’s take a look at the update rule:

👁 Image from [1]

Image from [1]

When comparing this to Sarsa, we see that now instead of taking the actually executed transition, we use the maximum Q value of the successor state. That is: we use a different policy for updating our estimates (a greedy policy taking the max. action) than for generating episodes (which is our default ε-greedy policy) (explaining why Q-learning is considered off-policy, and answering Excercise 6.11 in Sutton’s book).

Regarding pseudo-code, there is no big different to Sarsa:

👁 Image from [1]

Image from [1]

Same for the Python implementation:

def q(env):
 observation_space, action_space = get_observation_action_space(env)
 Q = np.zeros((observation_space.n, action_space.n))

 for _ in range(NUM_STEPS):
 observation, _ = env.env.reset()
 terminated = truncated = False

 while not terminated and not truncated:
 action = get_eps_greedy_action(Q[observation])
 observation_new, reward, terminated, truncated, _ = env.env.step(action)
 Q[observation, action] = Q[observation, action] + ALPHA * (
 reward + env.gamma * np.max(Q[observation_new]) - Q[observation, action]
 )
 observation = observation_new

 return np.array([np.argmax(Q[s]) for s in range(env.env.observation_space.n)])

Before moving on, let’s take a moment to compare Sarsa and Q-learning – and do so following an interesting example Sutton presents. Goal is finding a path through the following grid world environment (quite similar to our toy problem):

👁 Image from [1]

Image from [1]

We have to go from (S)tart to (G)goal as fast as possible while not falling off the cliff. Q-learning uses a max-operation in the update – thus the learned Q function prefers the direct, optimal path. However, our action selection is ε-greedy – meaning if such a random action is chosen, we could fall off the cliff. In contrast, Sarsa considers this action selection implicitly due to using its own policy to update the value estimates. And indeed, the collected rewards by Sarsa are higher than those collected by Q-learning:

👁 Image from [1]

Image from [1]

It should be noted though, that in general Q-learning is much more widely used, and unlikely Sarsa was successfully deployed to hard real-world problems – the promises and advantages of off-policy learning just seem to be too good.

Next, we will consider two more, slightly advanced, algorithms, building upon the previously introduced ones.

Expected Sarsa

Expected Sarsa basically extends Sarsa with an expectation. Let’s recall Sarsa’s update rule:

👁 Image from [1]

Image from [1]

Now, we re-write the Q-value of step + 1 to account for the current policy, and form an expectation over that:

👁 Image from [1]

Image from [1]

That is, instead of simply taking states S_ {t+1} and action A_{t+1}, as they were observed in the episode, we vary A_{t+1} and weigh the resulting outcome by how likely it is under the current policy (a softmax instead of an argmax).

Sutton don’t list any pseudocode, so let’s directly go into the Python version:

def expected_sarsa(env: ParametrizedEnv) -> np.ndarray:
 observation_space, action_space = get_observation_action_space(env)
 Q = np.zeros((observation_space.n, action_space.n))

 def _get_action_prob(Q: np.ndarray) -> float:
 return (
 Q[observation_new, a] / sum(Q[observation_new, :])
 if sum(Q[observation_new, :])
 else 1
 )

 for _ in range(NUM_STEPS):
 observation, _ = env.env.reset()
 terminated = truncated = False
 action = get_eps_greedy_action(Q[observation])

 while not terminated and not truncated:
 observation_new, reward, terminated, truncated, _ = env.env.step(action)
 action_new = get_eps_greedy_action(Q[observation_new])
 updated_q_value = Q[observation, action] + ALPHA * (
 reward - Q[observation, action]
 )
 for a in range(action_space.n):
 updated_q_value += ALPHA * _get_action_prob(Q) * Q[observation_new, a]
 Q[observation, action] = updated_q_value
 observation = observation_new
 action = action_new

 return np.array([np.argmax(Q[s]) for s in range(observation_space.n)])

Expected Sarsa removes the variance in Sarsa stemming from the action selection in the second timestep, and in fact given the same experience usually performs better than Sarsa. But to be honest, I’m not sure how widely used and known (expected) Sarsa is, with Q-learning being much more popular and widespread – as mentioned above.

Double Q-Learning

But now we come to another fundamental algorithm / a fundamental concept in all of RL: Double Q-learning, which addresses the problem of maximization bias.

Many RL algorithms use some sort of maximization – e.g. in the previously introduced Q-learning for the update step we use a max-operation over Q values – and in particular, use a maximum of expected values as an estimate for the maximum value – which can lead to significant positive bias.

Sutton give some nice intuition and example why this is the case: imagine there are several actions, whose true values q(s, a) are all 0 – but our expectations Q(s, a) are taken from some distribution with mean 0, exhibiting positive and negative values. Thus, the estimated maximum will always be above 0. Let’s demonstrate this with a practical example. Consider the following MDP:

👁 Image from [1]

Image from [1]

We always start in state A, and can either go right with reward 0 to a terminal state, or go left to B. From B we can only continue left to another terminal state, but can do so by choosing one of several actions. Their corresponding reward is normally distributed with negative mean and variance 1 – i.e.: we will run into the maximization bias problem. And indeed, we can observe this by deploying Q-learning to this problem:

👁 Image from [1]

Image from [1]

The best action would always be go right, because on average, this gets the least negative reward. However, as we can see from the graph, as long as the value estimates have not converged well yet, Q-learning tends to go left due to the maximization bias.

The graph actually also shows a second algorithm, Double Q-learning – the one we will introduce in this section – and as we can see, this algorithm fares significantly better.

So let|s come to how to solve this problem – and in particular introduce aforementioned double-Q-learning algorithm. The issue comes from the fact that we use the same estimator for estimating our metric of interest (the value function), as well as choosing the maximum of this. If we could partition our dataset, learn two estimators – and obtain the value estimate from one, and use the other for the max-operation – the correlation would be reduced, and the problem mitigated. This is the concept of double learning, and exactly what double-Q-learning does.

The update rule looks as follows:

👁 Image from [1]

Image from [1]

Let’s compare this to standard Q-learning:

👁 Image from [1]

Image from [1]

As we can see, we introduce a second Q-value estimator, and use it’s estimated Q value as the update target, while the value-maximizing action is taken from the first Q-value estimator.

How do we train this? Well, in every step we simply flip a coin, and update Q_1 or Q_2 depending on the outcome. For deployment, we can then either use Q_1 or Q_2, or a combination of both.

Sutton list the following pseudocode:

👁 Image from [1]

Image from [1]

And in Python this looks as follows – there shouldn’t be any surprises:

def double_q(env):
 observation_space, action_space = get_observation_action_space(env)
 Q_1 = np.zeros((observation_space.n, action_space.n))
 Q_2 = np.zeros((observation_space.n, action_space.n))

 for _ in range(NUM_STEPS):
 observation, _ = env.env.reset()
 terminated = truncated = False

 while not terminated and not truncated:
 action = get_eps_greedy_action(Q_1[observation] + Q_2[observation])
 observation_new, reward, terminated, truncated, _ = env.env.step(action)
 if random.randint(0, 100) < 50:
 Q_1[observation, action] = Q_1[observation, action] + ALPHA * (
 reward
 + env.gamma * Q_2[observation_new, np.argmax(Q_1[observation_new])]
 - Q_1[observation, action]
 )
 else:
 Q_2[observation, action] = Q_2[observation, action] + ALPHA * (
 reward
 + env.gamma * Q_1[observation_new, np.argmax(Q_2[observation_new])]
 - Q_2[observation, action]
 )
 observation = observation_new

 return np.array([np.argmax(Q_1[s]) for s in range(env.env.observation_space.n)])

Discussion and Advanced Techniques

Let’s end this post with a discussion on general tips for implementing RL algorithms. In this post series, we mainly strive for introducing the basic algorithms, and not clutter them too much with "fancy" tricks optimizing their performance – that’s why they are omitted above, but discussed here.

The first one is [[ε](https://www.ascii-code.com/character/%CE%B5)](https://www.ascii-code.com/character/%CE%B5)-decay: in our experiments we have kept ε constant throughout the runtime of our algorithm – whereas in practice, one would usually employ a strategy of decaying ε over time. This way, initially we explore more, while later on we converge to the optimal policy (who wants to use a robot who tries out random actions with 5% probability?).

Next, I’d like to discuss the initialization of the Q estimates. Sutton suggests a random implementation, but setting Q(s, :) to 0 for terminal states s. The latter part seems like a pretty strong assumption, knowing about terminal states – and to be honest I’m not sure how this is done. Further, I often see an initialization to 0 or small values as recommend – just to not overestimate bad states. This was also confirmed by my experiments on the simple environment grid world, where random initialization would not give good results, or require many more steps. I thus opted for the zero initialization, as done already in the post about MC methods. Note this requires randomly breaking ties in the action selection – when initially all Q estimates are 0, the argmax would always return the same (the first) element – thus in this case we return a random action.

However, it should be noted that RL can very much feel like an empirical science with no right or wrong – there is usually huge variance in experiments, and one needs to find out what works well and what doesn’t.

Finally, in the first introduced algorithm, Sarsa, I mentioned setting Q values of terminal states to 0 in the update step (i.e. a similar intention as by Sutton, discussed above, but during a different time – during the update). This is also a recommended practice, which does not appear in the pseudocode. I introduced it when debugging Sarsa (it can be easy to make mistakes), hoping for performance improvements due to this. Eventually, the error was elsewhere, and for this simple environment this was not needed – thus I omitted it from the following algorithms.

Conclusion

In this post we introduced Temporal Difference (TD) learning, which can be viewed as a mix of Dynamic Programming (DP) and Monte Carlo methods (MC) – combining the strengths of both: TD methods do not require a model, but still are able to bootstrap without having to wait for full episodes to finish.

We started with a theoretical introduction and the prediction problem, showing against which solution TD learning converges (learning the underlying MDP and minimizing the error induced by this), as opposed to the solution found by MC methods (minimizing the empirical error on the given datasets of rollouts).

Then, we dove into the control problem and introduced four algorithms: Sarsa, Q-learning, expected Sarsa and double Q-learning. Q-learning might be one of the most influential algorithms in the history of RL, and double Q-learning builds on top of this to avoid the common maximization bias.

For all methods we showed pseudocode as well as complete Python code, all of which one can find on GitHub.

Thanks for reading! And if you liked this post, I would greatly appreciate some claps and a follow!

Other Posts in this Series

References

[1] http://incompleteideas.net/book/RLbook2020.pdf

[2] https://arxiv.org/abs/1312.5602

Written By

Oliver S

See all from Oliver S

Machine Learning, Python, Q Learning, Reinforcement Learning, Thoughts And Theory

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/temporal-difference-learning-combining-dynamic-programming-and-monte-carlo-methods-for-e0c2f0829a51/