Teaching AI to Learn How Humans Plan Efficiently

Using reinforcement learning to build a cognitive model of hierarchical discovery

Aug 24, 2020

19 min read

Human planning is hierarchical. Whether planning something simple like cooking dinner or something complex like a trip abroad, we usually begin with a rough mental sketch of the goals we want to achieve ("go to India, then return back home"). This sketch is then progressively refined into a detailed sequence of sub-goals ("book flight ticket", "pack luggage"), sub-sub-goals, and so on, down to the actual sequence of bodily movements that is much more complicated than the original plan.

Efficient planning requires knowledge of the abstract high-level concepts that constitute the essence of hierarchical plans. Yet how humans learn such abstractions remains a mystery.

Here, we show that humans spontaneously form such high-level concepts in a way that allows them to plan efficiently given the tasks, rewards, and structure of their environment. We also show that this behavior is consistent with a formal model of the underlying computations, thus grounding these findings in established computational principles and relating them to previous studies of hierarchical planning.

👁 Example of hierarchical planning [6]

Example of hierarchical planning [6]

The figure above depicts an example of hierarchical planning, namely how someone might plan to get from their office in Cambridge to purchase a dream wedding dress in Patna, India. Circles represent states and arrows represent actions that transition between states. Each state represents a cluster of states in the lower level. Thicker arrows indicate transitions between higher-level states, which often come to mind first.

A Bayesian perspective

When applied to computationally intelligent agents, hierarchical planning could enable models with advanced planning abilities. Hierarchical representations can be modeled from a Bayesian viewpoint by assuming a generative process for the structure of a particular environment. Existing work on this problem includes the development of a computational framework for acquiring hierarchical representations under a set of simplified assumptions on the hierarchy, i.e. modeling how people create clusters of states in their mental representations of reward-free environments in order to facilitate planning.

In this work, we contribute a Bayesian cognitive model of hierarchical discovery that combines knowledge of clustering and rewards to predict cluster formation, and compare the model to data obtained from humans.

We analyze situations with both static and dynamic reward mechanisms, finding that humans generalize information about rewards to high-level clusters and use information about rewards to create clusters, and that reward generalization and reward-based cluster formation can be predicted by our proposed model.

Theoretical background

A key area where psychology and neuroscience combine is the formal understanding of human behavior in relation to assigned actions. We ask:

What is the planning and methodology employed by human agents when faced with accomplishing some task? How do humans discover useful abstractions?

This is especially interesting in light of the unique ability of humans and animals to adapt to new environments. Previous literature on animal learning suggests that this flexibility stems from a hierarchical representation of goals that allows for complex tasks to be broken up into low-level subroutines that can be extended across a variety of contexts.

Chunking

The process of chunking occurs when actions are stitched together into temporally extended action sequences that achieve distant goals. Chunking is often the result of the transfer of learning from a goal-directed system to a habitual system, which executes actions in a stereotyped way.

From a computational standpoint, such a hierarchical representation allows for agents to quickly execute actions in an open loop, reuse familiar action sequences whenever a known problem is encountered, learn faster by tweaking established action sequences to solve problems reminiscent of those seen previously, and plan over extended time horizons. Agents do not need to be concerned with the minuscule tasks associated with goal achievement, e.g., the goal of going to the store being broken down into leaving the house, walking, and entering the store as opposed to getting up out of bed, moving the left foot forward, then the right one, etc.

Hierarchical reinforcement learning

The question of how agents should make rewarding decisions is the subject of reinforcement learning. Hierarchical reinforcement learning (HRL) has become the prevailing framework for representing hierarchical learning and planning. Within research on modeling of HRL, several ideas have been presented around potential methods of model construction.

We focus on the idea that people spontaneously organize their environment into clusters of states that constrain planning. Such hierarchical planning is more efficient in time and memory than naive or flat planning, which include low-level actions and is consistent with people’s limited working memory capacity [3].

In the diagram below, thick nodes and edges indicate that they must all be considered and maintained in short-term memory in order to compute the plan, and gray arrows indicate cluster membership. We observe that planning how to get from state s to state g in the low-level graph G takes at least as many steps as actually executing the plan (top), introducing high-level graph H alleviates this problem reduces computational costs (middle), and extending the hierarchy recursive further reduces the time and memory involved in planning (bottom).

👁 Hierarchical representations reduce the computational costs of planning [6]

Hierarchical representations reduce the computational costs of planning [6]

Solway et al. provide a formal definition of an optimal hierarchy, but they do not specify how the brain might discover it [2]. We hypothesize that an optimal hierarchy depends on the structure of environment, including both graph structure and the distribution of observable features of the environment, specifically rewards.

Model

We assume that agents represent their environment as a graph, where nodes are states in the environment and edges are transitions between states. The states and transitions may be abstract or as concrete as subway stations and the train lines traveling between them.

Structure

We represent the observable environment as graph G = (V, E) and the latent hierarchy as H. Both G and H are unweighted, undirected graphs. H consists of clusters, where each low-level node in G belongs to exactly one cluster, and bridges, or high-level edges, that connect these clusters. Bridges can exist between clusters k and k’ only if there is an edge between some v, v’ ∈ V such that v ∈ k and v’∈ k’, i.e., each high-level edge in H has a corresponding low-level edge in G.

In the illustration below, colors denote cluster assignments. Black edges are considered during planning, while gray edges are ignored by the planner. Thick edges correspond to transitions across clusters. The transition between clusters w and z is accomplished via a bridge.

👁 Example high-level graph (top) and low-level graph (bottom) [6]

Example high-level graph (top) and low-level graph (bottom) [6]

Prior to the addition of rewards, the learning algorithm discovers optimal hierarchies given the following constraints:

Small clusters
Dense connectivity within clusters
Sparse connectivity across clusters

However, we do not want clusters to be too small – in the extreme, each node is its own cluster, which renders the hierarchy useless. Additionally, while we want sparse connectivity across clusters, we want to maintain bridges across clusters in order to preserve properties of the underlying graphs.

We use the discrete-time stochastic Chinese Restaurant Process (CRP) as a prior for clusters. The discovery of hierarchies can be accomplished by inverting the generative model to obtain the posterior probability of hierarchy H. The generative model formally presented in [6] generates such hierarchies.

Rewards

In the context of the graph G, rewards can be interpreted as observable features of vertices. Because people often cluster based on observable features, it is reasonable to model clusters induced by rewards [5]. Furthermore, we assume that each state delivers a randomly determined reward and that the agent’s goal is to maximize the total reward [8].

Since we hypothesize that clusters induce rewards, we model each cluster as having an average reward. Each node in that cluster has an average reward drawn from a distribution centered around the average cluster reward. Finally, each observed reward is drawn from a distribution centered around the average reward of that node. A formal discussion is provided in [1].

To simplify inference, we first assume that rewards are constant, static. We label rewards that can change between observations with some fixed probability as dynamic.

We conducted two experiments to test our hypothesis about human behavior and understand how well it could be predicted by our model. In particular, we studied to what degree clusters drive inferences about rewards, and to what degree rewards drive the formation of clusters. For each experiment, we collected human data and compared it to the predictions of the model.

Clusters induce rewards

The goal of the first experiment was to understand how rewards generalize within state clusters. We tested whether graph structure drives cluster formations and whether people generalize a reward observed at one node to the cluster that the node belongs to.

Setup

The experiment was conducted by asking 32 human subjects to choose a node to visit next as specified in the following scenario. Participants were randomly presented with either the graph below or a flipped version of it, to ensure that bias of handedness or graph structure was not introduced. We predicted that participants would choose the node adjacent to the labeled one that was located in the larger cluster, i.e. the gray node to the left of the blue one in the first case, and the gray node to the right of the blue one in the second case.

Participants were presented with the following task and associated graph:

You work in a large gold mine that is composed of multiple individual mines and tunnels. The layout of the mines is shown in the diagram below (each circle represents a mine, and each line represents a tunnel). You are paid daily, and are paid $10 per gram of gold you found that day. You dig in exactly one mine per day, and record the amount of gold (in grams) that mine yielded that day. Over the last few months, you have discovered that, on average, each mine yields about 15 grams of gold per day. Yesterday, you dug in the blue mine in the diagram below, and got 30 grams of gold. Which of the two shaded mines will you dig in today? Please circle the mine you choose.

👁 Graph of mines presented to participants [1]

Graph of mines presented to participants [1]

We expected most participants to automatically identify the following clusters, with nodes colored in peach and lavender to denote the different clusters, and make a decision about which mine to select with these clusters in mind. It was hypothesized that participants would select a peach-colored node as opposed to a lavender one, since the node with label 30, a fairly larger than average reward, is in the peach-colored cluster.

👁 Graph of mines presented to participants, with likely clusters shown [1]

Graph of mines presented to participants, with likely clusters shown [1]

Inference

We approximated Bayesian inference over H using Metropolis-within-Gibbs sampling [4], which updates each component of H by sampling from its posterior, conditioning on all other components in a single Metropolis-Hastings step. We employed a Gaussian random walk as the proposal distribution for continuous components, and the conditional CRP prior as the proposal distribution for cluster assignments [7]. The approach can be interpreted as stochastic hill climbing with respect to a utility function defined by the posterior.

Results

There were 32 participants in each of the human and simulated groups. The top three clusterings outputted by the model are shown below (left panel). All top three results were the same, indicating that the model identified the colored groupings with high confidence. The results for participants, as well as those for the static rewards model, are visualized in the bar chart below (right panel), depicting the proportion of human and simulated subjects who chose to visit node 2 next. The solid black line indicates the mean and the dotted black lines indicate the 2.5th and 97.5th percentiles.

👁 Results of the rewards generalization within clusters experiment [1]

Results of the rewards generalization within clusters experiment [1]

The listed p-values in the table below were calculated via a right-tailed binomial test, where the null was assumed to be a binomial distribution over choosing left or right gray node. The significance level was taken to be 0.05, and both the human experimental results and modeling results were statistically significant.

👁 Actions taken by humans and the static rewards model [1]

Actions taken by humans and the static rewards model [1]

Rewards induce clusters

In the second experiment, the goal was to determine whether rewards induce clusters. We predicted that nodes with the same reward positioned adjacent to each other would be clustered together, even if the structure of the graph alone would not induce clusters.

Recall that Solway et. al showed that people prefer paths that cross the fewest hierarchy boundaries [2]. Therefore, between two otherwise identical paths, the only reason to prefer one over the other would be because it crosses fewer hierarchy boundaries. One possible counterargument to this is that people pick the path with higher rewards. However, in our setup detailed below, rewards are given only in the goal state, not cumulatively over the path taken. Additionally, the magnitude of rewards was varied between trials. Therefore, it is unlikely that people would favor a path because nodes along that path had higher rewards.

Setup

This experiment was conducted on the web using Amazon Mechanical Turk (MTurk). Participants were given the following context about the task:

Imagine you are a miner working in a network of gold mines connected by tunnels. Every mine yields a certain amount of gold (points) each day. On each day, your job is to navigate from a starting mine to a target mine and collect the points from the target mine. On some days, you will be free to choose any mine you like. On those days, you should try to pick the mine that yields the most points. On other days, only one mine will be available. The points of that mine will be written in green and the other mines will be grayed out. On those days, you should navigate to the available mine. The points of each mine will be written on it. The current mine will be highlighted with a thick border. You can navigate between mines using the arrow keys (up, down, left, right). Once you reach the target mine, press the space key to collect the points and start the next day. There will be 100 days (trials) in the experiment.

The graph below (left panel) was presented to participants. As in the previous experiment, participants were randomly given either the configuration shown in or the horizontally-flipped version of the same graph in order to control for potential left-right asymmetry. Expected induced clusters are depicted as well, with nodes numbered for reference (right panel).

👁 Graph of mines presented to MTurk participants (left), with likely clusters shown (right) [1]

Graph of mines presented to MTurk participants (left), with likely clusters shown (right) [1]

We will refer to the first case, where participants are free to navigate to any mine, as free-choice and the second case, where participants navigate to a specified mine, as fixed-choice. Participants received a monetary reward for each trial to discourage random responses.

At each trial, reward values were changed with probability 0.2. New rewards were drawn uniformly at random from the interval [0, 300]. However, the grouping of rewards remained the same across trials: nodes 1, 2, and 3 always had one reward value, nodes 4, 5, and 6 had a different reward value, and nodes 7, 8, 9, and 10 had a third reward value.

The first 99 trials allowed the participant to develop a hierarchy of clusters. The final trial, which acted as the test trial, asked participants to navigate from node 6 to node 1. Assuming that rewards induced the clusters shown in above, we predicted that more participants would take the path through node 5, which crosses only one cluster boundary, over the one through node 7, which crosses two cluster boundaries.

Inference

We modeled the fixed-choice case, with the assumption that the tasks in all 100 trials were all the same as the 100th trial presented to participants, the test trial. First, we assumed static rewards, where the rewards remained constant across all trials. Next, we assumed dynamic rewards, where rewards changed for each trial.

In contrast to the previous experiment, where the participant picks a single node the model predicts that node, this experiment is concerned with the second node of the full path the participant chose to take from the start node to the goal node. Therefore, in order to compare the model to human data, we used a variant of breadth-first search, hereafter referred to as hierarchical BFS, to predict a path from the start node (node 6) to the goal (node 1).

Static rewards. For each subject, we sampled from the posterior using Metropolis-within-Gibbs sampling and chose the most probable hierarchy, i.e., the hierarchy with the highest posterior probability. Then, we used hierarchical BFS to first find a path between clusters and then between the nodes within the clusters.

Dynamic rewards. For dynamic rewards, we used online inference. For each simulated participant, we allowed the sampling for each trial to progress only 10 steps. Then, we saved the hierarchy, and added information about the modified rewards. Next, we allowed sampling to progress again, starting from the saved hierarchy. As in the human experiment, at the beginning of each trial, there was a 0.2 probability that the rewards would be re-randomized to new values, although the rewards were always equal within clusters. This inference method simulated how human participants might learn cumulatively over the course of many trials. We assumed, for the purpose of this experiment, that people keep only one hierarchy in mind at a time, rather than updating multiple hierarchies in parallel. We also modified the log posterior to penalize disconnected clusters, because such clusters became much more common under this type of inference.

Results

There were 95 participants in each of the human and two simulated groups. The null hypothesis is represented by an equal number of participants choosing a path through node 5 and through node 7, since in the absence of any other information and given that both paths are of equal length, a participant is equally likely to choose either.

👁 Actions taken by humans and the static and dynamic rewards models [1]

Actions taken by humans and the static and dynamic rewards models [1]

As given in the table above, the results of the human experiment and static rewards modeling were statistically significant at α = 0.05. Furthermore, as shown below, the results of the human experiment are in the 90th percentile of a normal distribution centered around 0.5, the expected proportion given the null hypothesis. In the figure, we include clusterings identified by the static rewards model (first row), the static rewards model with cluster formation between disconnected components penalized second row), and the dynamic rewards model (third row).

👁 Clusters identified by simulations [1]

Clusters identified by simulations [1]

Static rewards. We used 1000 iterations of Metropolis-within-Gibbs sampling to generate each sample, with a burnin and lag of 1 each. The simulation under static rewards certainly favors paths through node 5, to a level that is statistically significant. Moreover, since its purpose is to model human behavior, this result is meaningful in light of the human data being statistically significant as well (0.0321 < α = 0.05).

👁 Human and simulated subjects' choices [1]

Human and simulated subjects’ choices [1]

Dynamic rewards. In order to mimic the human trials, we ran 100 trials, each with 10 iterations of Metropolis-within-Gibbs to sample from the posterior. The burnin and lag were again set to 1. The online inference method appears to have modeled human data better than modeling for static rewards, even though the group of simulated participants under dynamic rewards modeling is farther from the hypothesis than the group simulated under static rewards modeling. 56 human participants and 54 simulated participants under dynamic rewards modeling chose to go through node 5 (a 3.4% difference), compared to 64 simulated participants under static rewards modeling (an 18.5% difference).

The bar chart above visualizes the proportion of human and simulated subjects whose chosen path’s second node was node 5. The solid black line indicates the expected proportion given the null hypothesis and the dotted black lines indicate the 10th and 90th percentiles.

Conclusions

Humans seem to spontaneously organize environments into clusters of states that support hierarchical planning, enabling them to tackle challenging problems by breaking them down into sub-problems at various levels of abstraction. People constantly rely on such hierarchical presentations to accomplish tasks big and small – from planning one’s day, to organizing a wedding, to getting a PhD – often succeeding on the very first attempt.

We have shown that an optimal hierarchy depends not only on graph structure, but also on observable characteristics of the environment, i.e., the distribution of rewards.

We built hierarchical Bayesian models to understand how clusters induce static rewards and how both static and dynamic rewards induce clusters, and found that most results were statistically significant in terms of how closely our models captured human actions. All data and code files for all the simulations and experiments presented are available in the GitHub repository linked here. We hope that the model presented, as well as related results in a recent paper, pave the way for future studies to investigate the neural algorithms that support the essential cognitive ability of hierarchical planning.

References

[1] A. Kumar and S. Yagati, Reward Generalization and Reward-Based Hierarchy Discovery for Planning (2018), MIT

[2] A. Solway, C. Diuk, N. Córdova, D. Yee, A. Barto, Y. Niv, and M. Botvinick, Optimal behavioral hierarchy (2014), PLOS Computational Biology

[3] G. Miller, The magic number seven plus or minus two: Some limits on our capacity for processing information (1956), The Psychological Review

[4] G. Roberts and J. Rosenthal, Examples of Adaptive MCMC (2009), Journal of Computational and Graphical Statistics

[5] J. Balaguer, H. Spiers, D. Hassabis, and C. Summerfield, Neural mechanisms of hierarchical planning in a virtual subway network (2016), Neuron

[6] M. Tomov, S. Yagati, A. Kumar, W. Yang, and S. Gershman, Discovery of hierarchical representations for efficient planning (2020), PLOS Computational Biology

[7] R. Neal, Markov Chain Sampling Methods for Dirichlet Process Mixture Models (2000), Journal of Computational and Graphical Statistics

[8] R. Sutton and A. Barto, Reinforcement Learning: An Introduction (2018), The MIT Press

Written By

Agni Kumar

See all from Agni Kumar

AI, Data Science, Editor’s Picks, Machine Learning, Reinforcement Learning

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/teaching-ai-to-learn-how-humans-plan-efficiently-1d031c8727b/