A Deep Dive into Problem States
Define appropriate problem states for your Markov Decision Process. Learn about higher-order MDPs and belief state variables to boost your…
If you ever tried to build a Markov Decision Process (MDP) model, the first step was probably to define the state variable s. After all, the state is used to describe the system, determine feasible actions, compute rewards, and govern time transitions. Without a state, you can’t do very much.
Given the pivotal role of states in MDPs, it is surprising that textbook definitions are often rather implicit. This article will dive into the properties of a well-defined problem state, enabling you to define purposeful states for your models and algorithms.
Introduction to MDP states
Before setting out: this article leans quite heavily on the viewpoints posed by Warren Powell (professor emeritus at Princeton University). For a more detailed assessment of the topics treated here, I recommend having a look at the academic works listed at the end of this piece.
Let’s consider some common definitions (in as far an explicit definition is offered):
Wikipedia (n.d.) – "A state variable is one of the set of variables that are used to describe the mathematical ‘state’ of a dynamical system".
Bellman (1957) – "… we have a physical system characterized at any stage by a small set of parameters, the state variables. "
Puterman (2014) – "At each decision epoch, the system occupies a state."
Sutton & Barto (2018) – "…signal to represent the basis on which the choices are made (the states)"
Bertsekas (2018) – "…is the state of the system, an element of some space. […] Many classical problems in control theory involve a state that belongs to a Euclidean space, i.e., the space of n-dimensional vectors of real variables, where n is some positive integer."
On an abstract level, it is clear that the state offers a numerical representation of the system at a given point in time. Furthermore, we may infer that the state is linked to the decision-making process. However, it is evident that many seminal works – valuable as they are – do not provide an overly thorough definition on the concept of ‘state’.
Powell defines the state as follows:
Powell (2022) – The state variable contains everything we know, and only what we need to know, to make a decision and model our problem. State variables include physical state variables R_t (the location of a drone, inventories, investments in stocks), other information I_t about parameters and quantities we know perfectly (such as current prices and weather), and beliefs B_t, in the form of probability distributions, describing parameters and quantities that we do not know perfectly (this could be an estimate of how much a drug will lower the blood sugar in a new patient, or how the market will respond to price).
He argues that a problem state serves three purposes:
- Determining actions. As such, the state should contain all information needed for intelligent decision-making.
- Computing the transition function. Combined with the selected action and exogenous environment information, the state definition should suffice to compute the next state.
- Computing the reward function. Together with the selected action, state information must be enough to compute the direct reward corresponding to the state-action pair.
A properly defined state should contain exactly the information needed for the aforementioned purposes – no more, no less. Less information, and the state is insufficient for its objective. More information, and you simply track redundant information.
Classifying information
Having established the purpose of a problem state, the next question is what information to include.
Many modelers restrict themselves to including solely physical properties of the system: the amount of cash held, the current position of trucks, the inventory in store… As we will shortly view, such definitions might be overly restrictive, ignoring crucial elements considering the purposes of the problem state.
Powell argues that state variables can be categorized into three classes: (i) beliefs, (ii) information and (iii) physical properties or resources. More precisely, beliefs are the all-encompassing class, with the other two classes being constrained subclasses. For the sake of argument, they can be treated as three separate classes.
- Physical: Directly observable properties of the system, e.g., resources. The term ‘physical’ can be considered a bit loosely here.
- Information: Non-tangible deterministic information. Can be directly observed, but is not necessarily a physical component of the system.
- Belief: Non-tangible probabilistic knowledge. Concretely, the belief may be represented by the parameters of a distribution.
Some examples will follow shortly, but first we need to discuss one more property – higher-order MDPs.
Higher-order MDPs
By definition, any MDP fulfills the Markov property, aka the memoryless property. This property states that decisions do not depend on states of the past, but only on the present state. If the problem can be formulated in such a way, we can break down extremely complicated decision problems into a sequence of more manageable subproblems, to be solved independently.
It is natural to interpret the Markov property as ‘not utilizing any information from the past’. However, there is a distinction between not using states from the past and information from the past. In fact, Powell argues that past information can be included into the present state perfectly fine.
From a decision-making perspective, this makes sense. Suppose sales have been steadily growing with 5% every year. Having only a single data point (most recent sales) would not reveal this trend, making anticipatory action impossible. By contrast, incorporating past sales figures in the state would reveal a clear upward trend.
Note we don’t need to refer back to past states – which would indeed violate the Markov property – but simply include past information in the present state. We encapsulate the memory within the state.
In mathematical terms, states including historical information are utilized in higher-order MDPs. Such models provide a richer representation of the system than first-order MDPs allow (which indeed only include present information). Taken to the extreme, a state could even include all historical information.
TL;DR: you can include past information in your problem states, and it often does help making better decisions!
Examples
Now that we have all necessary ingredients to define our states, let’s provide some brief examples!
Example 1: Energy management
- Physical: Current battery level
- Information: Electricity price, wind speed
- Belief: Forecasted energy demand distribution, remaining battery life
Example 2 – Warehouse management
- Physical: Physical product inventory in storage
- Information: Reserved (non-shipped) products, ordered replenishments
- Belief: Sales forecast scenarios per product
Example 3 – Portfolio management
- Physical: Amount of cash held, amount of each stock held
- Information: Past and current stock prices
- Belief: Probabilistic models of future price movements
Although the categorization in terms of physical-, information- and belief variables is not a necessity, it helps to expound the information that is useful and needed to make the right decisions, providing a broader perspective on the matter.
Closing words
States are a key element of MDPs. Unfortunately, many modelers take an overly restrictive view on the concept, as such omitting valuable historical data, non-tangible information, or probabilistic information.
The following takeaways may be helpful in designing richer states that leverage decision-making:
- A state does not only need to contain physically observable properties of the system, but can also hold intangible information.
- Past information can be incorporated in states, as long as decisions are made based on the present state. In other words, states can exhibit memory.
- Even when not having perfect information (e.g., probabilistic knowledge), we may incorporate beliefs in our states to aid action selection.
Interested in other model components of Markov Decision Processes as well? Check out the following article:
References
Bellman, R. (1957). A Markovian decision process. Journal of mathematics and mechanics, 679–684.
Bertsekas, D. (2019). Reinforcement learning and optimal control. Athena Scientific.
Salnikov, V., Schaub, M. T., & Lambiotte, R. (2016). Using higher-order Markov models to reveal flow-based communities in networks. Scientific reports, 6(1), 1–13.
Powell, W. B. (2014). Clearing the jungle of stochastic optimization. In Bridging data and decisions (pp. 109–137). Informs.
Powell, W. B., & Meisel, S. (2015). Tutorial on stochastic optimization in energy – Part II: An energy storage illustration. IEEE Transactions on Power Systems, 31(2), 1468–1475.
Powell, W. B. (2022). Reinforcement Learning and Stochastic Optimization: A unified framework for sequential decisions. John Wiley & Sons.
Powell, W. B. (n.d.). Modeling. https://castlelab.princeton.edu/modeling/
Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press.
Wikipedia (n.d.) State variable. https://en.wikipedia.org/wiki/State_variable
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS