Why Reinforcement Learning Doesn’t Need Bellman’s Equation

A re-assessment of the famous Bellman equation in Reinforcement Learning and MDP formulations

Feb 20, 2022

11 min read

👁 Richard Bellman founded dynamic programming and his famous recursive equation while working at RAND Corporation, based in Santa Monica. Photo by Sung Shin on Unsplash

Richard Bellman founded dynamic programming and his famous recursive equation while working at RAND Corporation, based in Santa Monica. Photo by Sung Shin on Unsplash

In academic circles, it’s often boilerplate to pair a Reinforcement Learning (RL) algorithm with a Markov Decision Process (MDP) formulation and the famed Bellman equation. At first sight, this makes sense, as we often aim to find approximately optimal policies for MDPs. However, in many ways RL has drifted so far from the origins of dynamic programming, that one may wonder whether we actually need Bellman equations in our problem definitions.

This article is not intended as a critique on the Bellman equation itself. As we will shortly see, it is a highly elegant formulation that captures even the messiest problems in a single line of mathematics. This is quite a feat, considering that many real-world combinatorial problems have more problem states than there are atoms in the universe.

When you grasp the complexity and magnitude of sequential decision making problems, you probably will appreciate Bellman’s contributions as much as I do. However, that does not mean we need to use them in RL.

What is dynamic programming?

👁 Photo by Julian Myles on Unsplash

Photo by Julian Myles on Unsplash

Dynamic programming has its origins in the 1950s. At the time, Richard Bellman worked for the RAND Corporation in Santa Monica. And there, sitting under a palmtree – or so I imagine – Richard Bellman invented his famous equation.

Although I will not go very deep into explaining dynamic programming here, the key notion is that large problems spanning a time horizon can be decomposed into a nested sequence of smaller problems. These smaller problems can be solved recursively, yielding an optimal decision-making policy providing the best action for every problem state.

To deploy dynamic programming, we must be able to define the problem as an MDP, including formalizations of (i) state space, (ii) action space, (iii) transition function and (iv) reward function (and possibly (v) discount rate). It is imperative that decisions only depend on the present state – the so-called Markov (or memoryless) property. If the conditions are met, we can (in theory) solve the problem using dynamic programming.

Bellman recognized that for each state, a value function could be defined, capturing both the direct reward/cost of an action and the (discounted) downstream value. For instance, we might decide how many products to stock up on each day, anticipating probabilistic sales scenarios.

Consider a three-state problem. In each state, we can take an action a∈A. Then, with probability p_s,s',a, we move to another state with associated value function V(s'). We can define the value functions for each state as a system of equations:

👁 A system of value functions for a three-state problem [image by author]

A system of value functions for a three-state problem [image by author]

Maximizing each value function optimizes the policy. In short, if we know the optimal value functions, we can always select the action that maximizes the expected cumulative reward. This is Bellman’s solution in a nutshell.

The beauty of the value function approach is that we don’t need to know anything about the actual problem structure or decision rules. Knowing the optimal value functions is equivalent to knowing the optimal policy.

The classical dynamic programming solution methods to find optimal value functions are value iteration and policy iteration. You can read about them in detail in the following articles:

Implement Value Iteration in Python – A Minimal Working Example

Implement Policy Iteration in Python – A Minimal Working Example

Bellman’s equation and the dimensionality curses

The three-state problem just discussed is very manageable, but with a million states the system of equations would be a mess. Fortunately, we don’t need to write out all equations by hand. In fact, the entire system can be collapsed into a single line of supreme mathematical elegance:

👁 Bellman's equation. Due to the mutual references between value functions the system is recursive. The optimal value functions can be found using dynamic programming techniques such as value iteration.

Bellman’s equation. Due to the mutual references between value functions the system is recursive. The optimal value functions can be found using dynamic programming techniques such as value iteration.

It’s dazzling, if you think about it. A state s can be a massive vector with elements, potentially even including stochastic knowledge. Action a might be subject to many constraints and only be identified by solving a complicated mathematical program. The deceptively simple probability function P might hide a grotesque stochastic system, leading to many possible outcome states s' that all require explicit evaluation. Sequential optimization in stochastic environments is often very, very messy. And yet, Bellman managed to capture all of those dynamics into a single, comprehensible line. It’s a work of art, really.

Unfortunately, the ability to model a problem does not mean we can actually solve it. Remember that three-state system of equations? In reality, the problem won’t contain just three states, but many, many more. Suppose we have a million states with a million possible actions each and a million states (outcomes) we can reach afterwards. That’s 10¹⁸ possibilities to consider, roughly the number of grains of sand on earth. Those familiar with combinatorial optimization know how easy it is to reach problems of such magnitudes, and far beyond.

Richard Bellman himself was perfectly aware of the computational limitations of his solution method, coining the term ‘curse of dimensionality’ to underline the lack of scalability for his approach. In fact, it might be three curses: state space, action space, and outcome space.

For many problems, we simply cannot resolve MDPs to optimality, no matter how much computing power we have available. This is where Reinforcement Learning comes in.

Value Function Approximation

If we wish to solve the Bellman equation, we must find a value function for every state s∈S. For every individual value function, we should evaluate the value function associated to each potential outcome s'∈S'(the outcome space S' might well equate the full state space S). For large state spaces, this simply doesn’t fly.

The typical Reinforcement Learning approach is to (i) repeatedly sample random state transitions (rather than exhaustively evaluating each possible outcome) and (ii) to replace the full state description by a set of features (avoiding the need to define value functions for every individual state). As such, we aim to find a policy which hopefully comes close to the optimal one. However, there are various ways to accomplish this.

Staying close the the dynamic programming approach, we might resort to value function approximation. With this approach, we explicitly try to model V(s) via an approximative function. Examples are Q-tables (SARSA, Q-learning), linear function approximations based on features, or critic networks. Note that the Bellman equation – in approximate form, of course -is used to express the decision-making policy:

👁 Policy definition for value function approximation. For this solution class, there is a visible link between the policy formulation and the Bellman equation, although the recursive aspect is gone.

Policy definition for value function approximation. For this solution class, there is a visible link between the policy formulation and the Bellman equation, although the recursive aspect is gone.

👁 Example of a linear value function approximation. Here, we replace the 'true' value function with a linear combination of features θ_f(s,a) and corresponding weights ϕ_f.

Example of a linear value function approximation. Here, we replace the ‘true’ value function with a linear combination of features θ_f(s,a) and corresponding weights ϕ_f.

You might say we approximate the Bellman equations, and that analogy is not too far-fetched.

Nonetheless, this analogy does not hold for the entire field of Reinforcement Learning, nor does it for human-decision making as a whole. If you approach a crossroad, you likely think in terms of ‘left’, ‘right’, ‘straight’ rather than pondering downstream values for the paths beyond. Many managers are perfectly able to make sequential decisions on a daily basis, without ever having heard of Bellman equations. Countless RL algorithms make no attempt to explicitly model value functions.

In short, value functions are not a necessity to formulate decision-making policies.

Objective functions

We are slowly arriving at the crux of this story (yes, finally). If there are so many solutions that do not require defining value functions, why do we keep including them in our problem formulations?

It appears a common misconception is at the root of this convention. In short: Bellman’s equation is not an objective function. It is an optimality condition. If we find the optimal value functions we find the optimal policy. However, the definition of an MDP does not require value functions in any way.

The sources of the confusion are easy to see: the maximization function hinting at an objective, the dynamic programming foundations, the optimal policy guarantee for optimal value functions. However, I repeat: Bellman’s equation is just an optimality condition. It can be solved using dynamic programming to find the optimal policy for an MDP. No more, no less.

So if the objective is not the Bellman equation, then what is it?

The answer is in the reward function (which is a mandatory component of any MDP). Whereas value functions are a somewhat abstract and artificial concept, reward functions define a very tangible reward (or cost) for every action we take. As we optimize over a (potentially infinite) time horizon, it makes sense to base our objective on cumulative rewards.

For finite decision horizons, we might simply take the sum of rewards as our objective function (e.g., maximize sales next week). For infinite horizon problems, we might take the discounted reward sequence or simply the average reward (Richard Sutton suggests the latter). Note these are very natural objectives, in line with how humans typically make decisions.

👁 Example of a finite horizon objective function. In this case, we simply maximize total rewards over time. Note we have no need for value functions.

Example of a finite horizon objective function. In this case, we simply maximize total rewards over time. Note we have no need for value functions.

Other Reinforcement Learning classes

With the Bellman equation out of the way, common RL methods such as policy gradient algorithms make more sense. In fact, out of the four policy classes identified by Warren Powell, only one (Value Function Approximation) closely ties to the Bellman equation. In alternative policy classes such as outlined below, we don’t see a trace of dynamic programming left:

👁 Policy Function Approximation (PFA) provides a direct expression to return decisions based on states.

Policy Function Approximation (PFA) provides a direct expression to return decisions based on states.

👁 Cost Function Approximation (CFA) augments reward functions to incorporate downstream effects, without overtly defining value functions for this purpose.

Cost Function Approximation (CFA) augments reward functions to incorporate downstream effects, without overtly defining value functions for this purpose.

You can see that Reinforcement Learning certainly does not entail solving the Bellman equation, not even approximately.

Why Discount Future Rewards In Reinforcement Learning?

The Four Policy Classes of Reinforcement Learning

Closing words

Time to complete the circle. Reinforcement Learning is a framework to resolve computationally challenging sequential decision problems. As they can grow rather complex, we need precise and unambiguous problem definitions to communicate with stakeholders and to find solutions. For this purpose, using the MDP notational conventions makes perfect sense.

Although a fair share of RL algorithms tries to approximate value functions as postulated in Bellman’s equation, many solutions do not follow the dynamic programming paradigm at all. As the Bellman equation is an optimality condition rather than an objective function, including it is redundant at best and misleading at worst. If anything, an RL decision-making policy might be an approximation of the Bellman equation, assuming we follow a value-based approach.

An MDP just needs four ingredients: (i) state space, (ii) action space, (iii) transition function and (iv) reward function. The Bellman equation or value functions are nowhere to be found. RL needs an objective function to work with, but this is simply a summed or averaged sequence of rewards. If we are not approximating value functions, Bellman’s equation has simply has no role.

Dynamic programming and the corresponding system of recursive value functions is a piece of mathematical brilliance. Bellman’s work was a breakthrough in sequential decision-making, which retains a large theoretical- and practical impact even today.

As with anything, there is just a time and place to use it.

Takeaways

Bellman’s equation is an optimality condition to solve MDPs, not the objective function. The underlying notion is that having the optimal value functions equates having the optimal policy. A system of value functions can be solved using dynamic programming.
Typical objective functions simply maximize cumulative rewards, e.g., taking the discounted reward stream or the average over time. These only require reward functions, not value functions.
Many Reinforcement Learning solutions have no direct ties to dynamic programming. Bellman’s equation is only explicitly used for one out of four policy classes, namely Value Function Approximation.
Value functions as propagated by Bellman do not necessarily reflect how humans make decisions. Maximizing (discounted) cumulative- or average rewards is a more natural way to communicate decisions.

URL: https://towardsdatascience.com/why-reinforcement-learning-doesnt-need-bellman-s-equation-c9c2e51a0b7/