Using Linear Programming to Boost Your Reinforcement Learning Algorithms

Large and high-dimensional action spaces are often computational bottlenecks in Reinforcement Learning. Formulating your decision problem…

Wouter van Heeswijk, PhD

Jan 17, 2021

10 min read

Large and high-dimensional action spaces are often computational bottlenecks in Reinforcement Learning. Formulating your decision problem as a linear program could vastly enhance the range of problems your algorithm can handle.

👁 Photo by Mehmet Turgut Kirkgoz via Pexels

Photo by Mehmet Turgut Kirkgoz via Pexels

Ask an operations researcher to solve any problem – be it optimizing your stock portfolio, scheduling your delivery routes, or fixing your marital problems – and they are likely to blurt out ‘linear programming’ as their go-to solution. A mathematical method conceived in the aftermath of WWII as a means to automate planning procedures at the US Army Air Forces, linear programming has since evolved into a mature field with widespread applications in transportation, manufacturing, finance, health care and many other domains. Typical implementations handle problems with thousands of decisions variables, countless constraints and a multitude of cost- and reward components. Such problems are not merely academic figments of imagination; actual real-world problems are modeled and solved this way. In fact, linear programming works so well in so many settings, it is sometimes surprising to see how little attention it receives in the reinforcement learning community.

Intro to Linear Programming

For those unfamiliar with the methodology, a very brief primer might be appropriate. A linear program is essentially a system of equations following a fixed structure, containing (i) an objective function that must be minimized or maximized, (ii) a set of constraints on the decision space and (iii) a definition of the decision variables with corresponding domain restrictions. As the method’s name suggests, this entire system should be linear – this powerful property allows to exploit to convexity of the problem structure and drastically improves the speed of the solution methods. Solving the system gives the optimal solution. The typical linear program recipe looks something like this:

👁 Image

Why bother with such formulations? Well, nobody better than founding father George Dantzig himself to illustrate the potency of linear programming [1]:

"Consider the problem of assigning 70 men to 70 jobs. An ‘activity’ consists of assigning the i-th man to the j-th job. […] Thus there are 2 70, or 140, restrictions and 70 70, or 4900, activities with 4900 corresponding zero-one decision variables. Unfortunately there are also 70 factorial permutations, or ways to make the assignments. […]

Now in this example 70 factorial is a very big number. To get some idea of how big, suppose we had had an IBM main-frame computer available at the time of the Big Bang fifteen million years ago. Would it between then and now have been able to examine all the possible solutions? No! […] Even if the Earth were filled with nanosecond speed computers, all working in parallel, the answer would still be no. […] The remarkable thing is that the simplex method with the aid of a modern computer can solve this problem in a split second."

Impressive, to say the least. This simplex method, developed in the late 1940s, has formed the basis for a number of highly advanced solvers such as CPLEX and Gurobi that are used nowadays. Combined with ongoing advances in hardware, the progress of the field has been staggering, tackling problems of ever-increasing magnitudes. Keep in mind linear programming is an exact solution method: it does not return simply a feasible or ‘good’ solution, but the best possible solution out of all these millions, billions or quadrillions of solutions, not seldom within seconds!

Application in Reinforcement Learning

Enough about linear programming for the moment – I don’t want to give the impression that I’m giving a marketing pitch or am trying to inflate my IBM stocks. Let’s turn our attention towards Reinforcement Learning now. For the sake of convenience, assume we try to (approximately) solve a Markov decision model. When trying to do so, we often run into the so-called three curses of dimensionality, referring to a state space, outcome space and action space all exploding when the size of the problem grows [2]. In the well-known Bellman equation, these computational problems are immediately visible:

👁 Image

The outcome space S’ refers to the set of future states that may be reached from the selected state-action pair. With each state having its own value attached, cycling through all of them for every feasible decision is a cumbersome task. Instead of evaluating all possible outcomes, we instead sample a single one and move forward. By doing this often enough, the law of large numbers predicts we should converge to the true value of the underlying random variable. For the state space S, instead of explicitly observing each state, we often define features (basically just explanatory variables) that distill the most salient properties predicting its value. As such, we don’t need to observe each individual state to assess how attractive it is; we can make a reasonable guess based on the features. These solutions are widely embedded in many reinforcement learning algorithms.

The dimensionality of the action space X receives comparatively less attention. Partially, this may be explained by the fact that full enumeration is possible for a lot of problems. Once we generate an action, we only need to compute the value of the state-action pair to gauge the allure of the action, so for modest action spaces enumeration works perfectly well. In a maze you can only move into four directions, and even Super Mario doesn’t have that many tricks up his sleeve. Things get more complicated on a chessboard or during a game of Go, but still nothing a modern computer can’t handle. Aside from enumeration, we of course have the powerhouse that is the policy gradient theorem, often deployed under the umbrella of actor-critic methods [3]. Instead of enumerating all feasible actions, we can cleverly sample them and adjust our sampling strategy along the way. Naturally we lose some information by not explicitly evaluating each action, but as long as we smoothly converge to the good action regions we often take this for granted.

Unfortunately, enumeration and actor-critic methods are not always there to save us. Let’s pick a simple example to illustrate. Consider a multi-modal transport hub, from which containers may be shipped by truck, barge or train. Suppose containers have 25 potential destinations, 10 different weight classes, and a due date ranging between 1 and 5. We then have 25105=1250 distinct container types, translating to a state vector with 1250 dimensions indicating the number of containers per type currently at the hub. Even without formal proof, it is easy to see the numbers of possible permutations is absolutely massive – dealing with these state- and outcome spaces head on is impossible. What we are interested in now, however, is the action space. Each container may be shipped by truck, barge or train, or left at the hub. If we have n containers at the hub, these 4 options per container give us 4^n feasible actions; exactly the type of exponential action space we’d rather not handle. To illustrate: with merely 10 containers we already face over one million feasible actions.

Enumeration can quickly be ruled out here. We don’t want to enumerate a million actions for every episode, and God-forbid we encounter 20 containers on a bad day. Actor-critic then? We could conceptualize various architectures for the actor network, but they all face the same challenges. The state- and action vectors may be massive, but they are also sparse. By far the most container types will have a value of 0 on a given day. Unfortunately, a neural network is not going to learn much by passing through a bunch of zeros. Sampling in high-dimensional spaces is a challenging task; most actors would get lost in the woods in this case.

Integrating Linear Programming and Reinforcement Learning

Despite facing a fairly clean and straightforward problem, we can’t enumerate and we can’t effectively apply actor-critic methods. Unsurprisingly, this is where linear programming re-enters the stage. Remember how this method conveniently could handle large, high-dimensional action problems? What a coincidence.

First of all, we need to transform our decision problem into the generic format that has been introduced earlier. In this case, we want to allocate our containers in a way that minimizes our transport costs. Knowing the cost functions of our modalities, we can comprehensively wrap this in an objective function. Most likely, not all allocations are allowed: there is a fixed number of trucks, the barge can only handle a certain weight, and obviously we cannot allocate more containers than we have… our set of constraints is already shaping up. The decision variables? A set of binaries seems to do the trick. Done. Naturally, not every problem allows itself to be captured in this rigid format so easily. Yet, it should also not be ruled out too easily. Academics and practitioners from many disciplines have had success in fitting highly complex problems into this format. With entire airline operations being controlled by linear programs, chances are your RL problem can also molded in the same shape.

👁 Linear optimization for a 2-dimensional polytope, by Ylloh via WikiMedia

Linear optimization for a 2-dimensional polytope, by Ylloh via WikiMedia

Right, so we have modeled our decision problem for today, but what about tomorrow? After all, the aspect of looking ahead is what makes reinforcement learning so interesting. For this, we need to add some artificial variables to the mix. These variables represent features that capture the downstream value of being in a given state. In this context, we would typically design these features manually, using linear expressions based on state and action to compute them.

Suppose – as one of our features – we keep track of the number of containers with a due date of 5. We have six of them now, we consider shipping four: we are left with two (i.e., ϕ_f(s,a)=ϕ_f(6,4)=6–4=2). Other examples could be a truck’s next location, the remaining cash budget after extending, etc. All these features are computed from state-action pairs; the corresponding weights θ_f are learned to attach value to the features.

For more depth on feature design in this context, the following articles on post-decision states might be relevant:

What Are Post-Decision States and What Do They Want From Us?

About Post-Decision States Again

The aforementioned equalities are easily added to the existing set of constraints. With the artificial variable ϕ embedded into the objective function, we only need to multiply it with the weight θ we learned to evaluate the downstream value of our actions [2].

👁 Image

All in all, linear programming is worth considering for anyone dealing with large numbers of decisions. Naturally, there are shortcomings, limitations and challenges not discussed in this article, but with a 70-year track record, this optimization method is far from a fad. It provides an incredible upscale compared to enumeration while preserving optimality in evaluating actions. When your laptop’s memory keeps filling up with endless arrays of enumerated actions, when your actor keeps leading you nowhere, linear programming might be just the boost your RL algorithm needs to restore order in those massive actions spaces. After all, as George Dantzig discovered decades ago: linear programs can run armies.

References

[1] An Interview with George B. Dantzig: The Father of Linear Programming – The College Mathematical Journal (1986). http://www.phpsimplex.com/en/Dantzig_interview.htm

[2] W.B. Powell (2011). Approximate Dynamic Programming, John Wiley and Sons, 2nd edition.

[3] R.S. Sutton & A.G Barto (2018). Reinforcement Learning: An Introduction, MIT Press, 2nd edition.

Written By

Wouter van Heeswijk, PhD

See all from Wouter van Heeswijk, PhD

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/using-linear-programming-to-boost-your-reinforcement-learning-algorithms-994977665902/