About Post-Decision States Again

On the (not so) subtle differences between state-action pairs and post-decision states in Reinforcement Learning.

Jun 23, 2021

6 min read

Thoughts and Theory

My previous article on post-decision states didn’t receive a lot of traction, so I decided to write another one (something with foxes and snares, I suppose).

In that previous article, I made a case for the similarities between state-action pairs and post-decision states in Reinforcement Learning. Very briefly, the post-decision state is the concatenation of state and action (e.g., a Tic-Tac-Toe board after adding a mark, but before your opponent does), a sort of limbo state before the world starts moving again.

While focusing on the similarities, however, I also buried some important differences between both concepts. Leaving them unmentioned felt unfinished. This article focuses on the distinctions that set post-decision states and state-action pairs apart, giving a more complete view on the matter.

What are post-decision states?

Of course I can’t jump right back in where I left off, so let’s summarize the concept of post-decision states (for a detailed description see Powell [1]). As said in the intro, it is the problem state after taking an action, but before new stochastic information arrives (which would lead us to the next pre-decision state). Typically, you may reach a subset of post-decision states S'⊆S __ , based on the current state and feasible actions. The concept is best understood by decomposing the transition functions that lead from one state to another.

Typically, we have a transition function f that takes the current state s_t, chosen action a_t and exogenous information ω_t to lead us to the next state s_t+1:

👁 Transition from one state to the next

Transition from one state to the next

This transition function contains a deterministic and a stochastic component, which we can explicitly distinguish. First, we might deterministically progress from the pre-decision state s_t to the post-decision state s_t^a:

👁 Deterministic transition from state-action pair to post-decision state

Deterministic transition from state-action pair to post-decision state

Second, we can sample ω_t from the stochastic process and move from the post-decision state to the next pre-decision state s_t+1:

👁 Stochastic transition from post-decision state to next pre-decision state

Stochastic transition from post-decision state to next pre-decision state

Instead of computing Q-values Q(s_t,a_t), we could also compute them for the post-decision state (Q(s_t^a)), offering a more explicit perspective. Furthermore, we typically design features based on post-decision states: inventory levels after reordering, grid positions after making a move, etc. Simply said: the post-decision state incorporates the most recent information we have.

Tic-Tac-Toe

In Tic-Tac-Toe – the example used in the previous article – state-action pairs and post-decision states are virtually identical. Only the latest board configuration matters, not the action taken to get there. The post-decision state is quite literally a concatenation of the state-action pair:

👁 View at available actions in a given Tic-Tac-Toe state. The green player can take three actions or (equivalently) reach three post-decision states. [image by author]

View at available actions in a given Tic-Tac-Toe state. The green player can take three actions or (equivalently) reach three post-decision states. [image by author]

For all practical purposes, state-action pairs and post-decision states are equivalent here. It does not matter which action you took last, only the current state of the board is relevant. In the current state, you either take 1 out of 3 allowed actions (i.e., 3 state-action pairs), or you evaluate the 3 post-decision states that may be reached.

The next two examples illustrate that the concepts are not always this similar.

Cliff walking

To show a case in which state-action pairs are different, consider the cliff walking problem. In SARSA or Q-learning, you would fill a lookup table with Q(s,a) values, with s indicating the position on the grid and a the feasible moves (left, right, down, up). In contrast, the post-decision states would simply be grid positions.

👁 Comparison between Q-values for state-action pairs and post-decision states. For every four state-action pairs, we only need to store the value of one post-decision state. [image by author]

Comparison between Q-values for state-action pairs and post-decision states. For every four state-action pairs, we only need to store the value of one post-decision state. [image by author]

Note that we have 48*4=192 Q-values when working with state-action pairs, yet only 48 post-decision values. What exactly is the difference?

Let’s take a closer look at the highlighted tile. It can be reached from four adjacent tiles, each with their own state-action pairs. However, for the remaining trajectory, it actually doesn’t matter how you land on a given tile – all that matters is the remaining number of steps to the goal.

The post-decision state neatly captures this phenomenon. Regardless of the preceding tile, we only need a single post-decision value to predict the remaining number of steps. Having four distinct Q(s,a) values pointing to a single tile is actually a quite inefficient solution. Observe that – to achieve the same accuracy – we need four times more observations than for the post-decision implementation.

With only four actions, we can probably take the hit, but what if we have 1,000 actions? For most RL problems, reducing the computational effort by a factor 1,000 would be quite the benefit. Post-decision states are the preferred modelling choice here.

Multi-armed bandit

The multi-armed bandit is a bit of an oddball in RL textbook examples, as it is essentially a stateless problem. We have n slot machines, pull a lever, and we’re back at the exact same set of slot machines. If there is no problem state, how do we compute a post-decision state?

👁 Photo by Carl Raw on Unsplash

Photo by Carl Raw on Unsplash

I actually don’t want to get into that philosophical question right now. What is more interesting, is that there are no direct rewards. Only after pulling the lever we observe the stochastic reward, meaning it is embedded into the exogenous variable ω. How can we incorporate that into a post-decision state?

Technically we can’t, I’m afraid. If the post-decision state must be in a subset of the state space S'⊆S and S does not exist – or, at best, it holds a single, constant state – we cannot derive any (meaningful) post-decision state. In contrast, the state-action pair is found in(s,a) ∈ S × A, which is something we can work with.

In this case, state-action pairs are superior over post-decision states. Even if we have actual states to work with, we must memorize the action taken to distinguish between rewards; we cannot simply heap all observations onto a single post-decision state.

As you see, it’s not always trivial to pick the best modeling approach!

Takeaways

In many cases post-decision states and state-action pairs can be easily interchanged, but the modeling implications might be substantial.
If it does not matter how you arrive at a given state, use post-decision states.
If rewards are stochastic and unknown at the time of selecting an action, use state-action pairs.

Interested in more background? Check out my earlier article on post-decision states:

What Are Post-Decision States and What Do They Want From Us?

References

[1] Powell, W. B. (2007). Approximate Dynamic Programming: Solving the curses of dimensionality. John Wiley & Sons.

Written By

Wouter van Heeswijk, PhD

See all from Wouter van Heeswijk, PhD

Dynamic Programming, Machine Learning, Reinforcement Learning, State, Thoughts And Theory

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/about-post-decision-states-again-5725e5c15d90/