Thompson Sampling

Multi-Armed Bandits: Part 5

Steve Roberts

Nov 2, 2020

21 min read

A Baby Robot’s Guide To Reinforcement Learning

👁 Photo by National Cancer Institute on Unsplash

Photo by National Cancer Institute on Unsplash

👁 Image

Overview

Over the last few parts in this series we’ve been looking at increasingly complex methods of solving the Multi-Armed Bandit problem. We’ve now reached the final and most complex of all the methods we’re going to look at: Thompson Sampling.

If you’re not already familiar with the bandit problem and its terminology you may want to first take a look at the earlier parts of this series, which are as follows:

Part 1: Mathematical Framework and Terminology– all the basic information needed to get started
Part 2: The Bandit Framework – a description of the code and test framework
Part 3: Bandit Algorithms _- The Greedy Algorithm
The Optimistic-Greedy Algorithm
The Epsilon-Greedy Algorithm (ε-Greedy)
Regret_
Part 4: The Upper Confidence Bound (UCB) Bandit Algorithm

All code for the bandit algorithms and testing framework can be found on github: Multi_Armed_Bandits

Recap

Baby Robot is lost in the mall. Using Reinforcement Learning we want to help him find his way back to his mum. However, before he can even begin looking for her, he needs to recharge, from a set of power sockets that each give a slightly different amount of charge.

Using the strategies from the multi-armed bandit problem we need to find the best socket, in the shortest amount of time, to allow Baby Robot to get charged up and on his way.

👁 Image

Baby Robot has entered a charging room containing 5 different power sockets. Each of these sockets returns a slightly different amount of charge. We want to get Baby Robot charged up in the minimum amount of time, so we need to locate the best socket and then use it until charging is complete.

This is identical to the Multi-Armed Bandit problem except that, instead of looking for a slot machine that gives the best payout, we’re looking for a power socket that gives the most charge.

👁 Image

Up until now, all of the methods we’ve seen for tackling the Bandit Problem have selected their actions based on the current averages of the rewards received from those actions. Thompson Sampling (also sometimes referred to as the Bayesian Bandits algorithm) takes a slightly different approach; rather than just refining an estimate of the mean reward it extends this, to instead build up a probability model from the obtained rewards, and then samples from this to choose an action.

In this way, not only is an increasingly accurate estimate of the possible reward obtained, but the model also provides a level of confidence in this reward, and this confidence increases as more samples are collected. This process of updating your beliefs as more evidence becomes available is known as Bayesian Inference.

Bernoulli Thompson Sampling

As an introduction, and to make things easier to work with, let’s simplify the power socket problem. Now, instead of each socket returning a varying amount of charge, each socket will either return some charge or no charge; the rewards have only two possible values: 1 when the chosen socket supplies a charge and 0 when it doesn’t. When a random variable has only two possible outcomes its behaviour can be described by the _Bernoulli distribution_.

So now, instead of the amount of charge varying per socket, the probability of a socket producing a charge varies with each socket. We want to find the socket with the highest probability of returning a charge, rather than the socket that gives the most charge.

As already mentioned, Thompson Sampling generates a model of the reward probabilities. When, as in this case, the available rewards are binary (win or lose, yes or no, charge or no charge) then the _Beta distribution_ is ideal to model this type of probability.

(For more information on the relationship between the Beta and Bernoulli distributions check out this excellent article: Beta Distribution – Intuition, Examples, and Derivation).

The Beta distribution takes two parameters, ‘α‘ (alpha) and ‘β‘ (beta). In the simplest terms these parameters can be thought of as respectively the count of successes and failures.

Additionally, a Beta distribution has a mean value given by:

👁 Image

👁 Figure 5.1: The Beta distribution for various values of alpha and beta.

Figure 5.1: The Beta distribution for various values of alpha and beta.

Initially we have no idea what the probability is of any given socket producing an output, so we can start by setting both ‘α‘ and ‘β‘ to one, which produces a flat line _Uniform distribution_ (shown as the flat, red, line in figure 5.1).

This initial guess at the probability of the socket producing an output is known as the _Prior Probability_; it is the probability of the specific event occurring before we have collected any evidence and in this case is represented by the Beta distribution Beta(1,1).

Once we test a socket, and obtain a reward, we can modify our belief in the likelihood of that socket returning some charge. This new probability, after some evidence has been collected, is known as the _Posterior Probability_. Again this is given by a Beta distribution, but now the values of ‘α‘ and ‘β‘ are updated with the value of the returned reward.

So, if a socket returns some charge, the reward will be 1 and ‘α‘, the count of the number of successes, will increment by 1. The count of the number of failures, ‘β‘, will not increase. If instead no reward was obtained, then ‘α‘ will stay the same and ‘β‘ will increment by 1. As more data is collected the Beta distribution moves from being a flat line to become an increasingly accurate model of the probability of the mean reward. By maintaining the values of ‘α‘ and ‘β‘ a Thompson sampling algorithm is able to describe the estimated mean reward and the level of confidence in this estimate.

In contrast to the Greedy algorithm, which at each time step selects the action with the highest estimated reward, even if the confidence in that estimate is low, Thompson sampling instead samples from the Beta distribution of each action and chooses the action with the highest returned value. Since actions that have been tried infrequently have wide distributions (see the blue curve in figure 5.1), they have a larger range of possible values. In this way, a socket that currently has a low estimated mean reward, but has been tested fewer times than a socket with a higher estimated mean, can return a larger sample value and therefore become the selected socket at this time step.

In the graph above, the blue curve has a lower estimated mean reward than the green curve. Therefore, under Greedy selection, green would be chosen and the blue socket would never be selected. In contrast, Thompson Sampling effectively considers the full width of the curve, which for the blue socket can be seen to extend beyond that of the green socket. In this case the blue socket may be selected in preference to the green one.

As the number of trials of a socket increases the confidence in the estimated mean increases. This is reflected in the probability distribution becoming narrower and the sampled value will then be drawn from a range of values that are closer to the true mean (see the green curve in figure 5.1). As a result, exploration decreases and exploitation increases, since the sockets with a higher probability of returning a reward will begin to be selected with increasing frequency.

On the other hand, sockets with a low estimated mean will start to be selected less frequently and will tend to be dropped early from the selection process. Consequently, their true mean may never be found. Since we are only interested in finding the socket with the highest probability of returning a reward, and finding it as quickly as possible, we don’t care if full information of poorly performing sockets is never obtained.

Bernoulli Socket Implementation

As in the socket experiments we carried out in previous parts of this series, we will be using a basic socket class, on top of which we add the specific functionality for the algorithm being studied. Then, using this new class, we run it through a set of experiments using the same test harness for all bandit algorithms. Full details of the power socket base class and the accompanying test system are given in Part 2 of this series and all the code can be found on github.

The implementation of Bernoulli Thompson sampling, as described above, is shown in the BernoulliThompsonSocket class:

In this class we initialise ‘α‘ and ‘β‘ to one, to give the Uniform Distribution. Then, when updating, we simply increment ‘α‘ if the socket returned a reward, otherwise we update ‘β‘.

The "sample" function draws a value from the Beta distribution, using the current values of ‘α‘ and ‘β‘ as its parameters.

Experiment Results

The evolution of the Beta distribution for each power socket, where we are using the simpler probabilistic power sockets, can be seen in Figure 5.2 below.

To keep things simple, we’ve reduced the number of sockets to three and these have true probabilities 0.3 (green), 0.7(red) and 0.8 (blue) of returning some power when tested.

👁 Figure 5.2: Thompson sampling using Beta distributions for 3 probabilistic power sockets, with true probabilities of 0.3, 0.7 and 0.8.

Figure 5.2: Thompson sampling using Beta distributions for 3 probabilistic power sockets, with true probabilities of 0.3, 0.7 and 0.8.

In Figure 5.2 above, the true means of 0.3, 0.7 and 0.8 are shown by the dashed lines. The legend displays the number of trials for each socket and the number of successes that have resulted from these trials.

The main points to note from Figure 5.2 are the following:

At time step 0 (not shown) all Beta distributions will have their α and β values set to one, to give a flat Uniform distribution.
Since all sockets have the same initial distribution, at time step 1 the blue power socket is selected arbitrarily. When tested it gives a reward, so its α value gets incremented by 1 and its probability density curve shifts to the right. The green and red sockets have not yet been tested, so retain their initial flat distribution (with the green curve being hidden behind the red curve).
At the second time step the blue socket is again selected and again it returns a reward. The blue curve squeezes slightly more to the right, since this socket has been tested twice and has returned a reward both times, the possibility still exists that this socket will return a reward every time it is selected.
By the 5th trial the blue socket has been selected once more, but this time it failed to give a reward. As a result the probability that it always returns a value drops to zero (at probability = 1.0). On the other hand, the green socket has now been tested twice and is yet to return a value, hence its probability density curve is shifted to the left with its highest value at probability =0, as there’s still a chance this socket never returns a reward.
At 15 trials the red socket has now been tried a couple of times. Since it’s returned a reward once, it has an estimated mean reward probability of 0.5. At this stage the blue socket has been tried 11 times and has returned a reward on 6 of these trials, giving it a slightly higher estimated reward probability of 0.54. In a Greedy system the blue socket would therefore be the chosen socket, however because the red socket has been tried less times than the blue socket, it can be seen to have a much wider probability density curve, giving it a good chance of being selected in preference to the blue socket.
The more times a socket is tested, the more confident we are in its estimate and the narrower its probability density curve becomes. The best socket will then be used more often and testing of the sub-optimal sockets will tail off. This behaviour can be seen at the end of our test, when the blue socket has been tried much more often than either of the two other sockets. It can also be seen that the green socket did finally return a reward on 2 of its trials. Consequently, it’s no longer possible that this is a socket that never returns a reward, so the probability of this happening (returning a reward with probability = 0) drops to zero.
It can also be seen that neither the blue nor red sockets have probability density curves centred exactly on their true mean probabilities. If the test was run for more time steps then the blue curve would eventually settle on the true mean value, but the red socket will have a greatly reduced number of trials (if any) and so may never come to find its true value. As mentioned, this isn’t an issue, since we’re only interested in finding the best socket, not the true mean values of the other sockets.

Gaussian Thompson Sampling

The simplified socket problem we’ve used so far is a good way to grasp the concepts of Bayesian Thompson Sampling. However, to use this method with our actual socket problem, in which the sockets aren’t binary, but instead return a variable amount of charge, we need to change things slightly.

In the previous problem we modelled the socket’s behaviour using a Beta distribution. This was chosen because the simplified socket output had only two possible outcomes, some charge or no charge, and could therefore be described using a Bernoulli distribution. When a value drawn from a Bernoulli distribution (the likelihood value) is multiplied by a value drawn from a Beta distribution (the prior probability), then the resultant value (the posterior probability) also has a Beta distribution. When this occurs, such that the likelihood multiplied by the prior results in a posterior with the same distribution type as the prior, the prior is referred to as a _Conjugate Prior_.

With our standard socket problem, each socket returns a real value described by a normal distribution. If we assume we know the variance of our socket (which is actually 1, since we use an unmodified version of the numpy randn function in our code), then from the _table of conjugate priors on Wikipedia_ we can see that the conjugate prior also has a normal distribution. If we don’t know the variance of our distribution, or we’re using a different distribution, then we simply need to pick one of other conjugate priors from the table and adjust our algorithm accordingly.

So we can model the output of a socket using a normal distribution and gradually refine this model by updating its mean and variance parameters. If, instead of using the variance, we use the precision ‘τ’ (tau), where precision is just one over the variance (precision τ = 1/variance), then we can use the simple update rules for the mean ‘μ₀‘ and total precision ‘τ₀‘ given by:

👁 Image

where;

‘τ‘ is the precision of the actual socket output, which in our case is just 1.
‘n‘ is the number of times the socket has been tested.
‘xᵢ‘ is the output received at each test ‘i’ of this socket (equivalent to the reward ‘Rᵢ‘ that we’ve used up till now).
‘μ₀‘ is the estimated mean (the mean of the distribution used to model the output).
‘τ₀‘ is the total precision of the distribution used to model the output.

At a first glance this looks rather intimidating, but all it’s basically saying is that we have 2 parameters ‘μ₀‘ and ‘τ₀‘ that we’ll update each time we test a socket, just as we did with ‘α‘ and ‘β‘ for the Bernoulli socket. Except in that case those parameters represented the number of successes and failures of the socket, whereas ‘μ₀‘ and ‘τ₀‘ represent the estimated mean and the precision, representing the confidence in the estimated mean.

Additionally we can make a couple of other simplifications:

We know that the amount of charge returned from a socket has a variance of 1 and so the precision ‘τ‘ is also 1. Therefore the update of the socket precision is simply τ₀ = τ₀ + n, where n is the number of times the socket has been tested, so each time a socket gets tested we simply increment its precision by 1.
For the estimated mean the numerator contains a sum over all of the outputs produced by the socket, multiplied by ‘τ‘. As we saw way back in the Sample Average Estimates section of Part 1, it’s not a good idea to retain the sum of rewards, since this could potentially grow to an unmanageable size. However, in our base socket implement we always calculate ‘Qₜ(a)’, the estimated value of action ‘a‘ at time step ‘t’, which is given by:

👁 Image

In this equation ‘Rᵢ‘ is the reward obtained at each of the time steps when action ‘a‘ was taken and is identical to the term ‘xᵢ‘ used in the update equation above. Therefore we can simply replace the summation in the update equation with ‘nQₜ(a)’ and end up with the following simplified update equation:

👁 Image

With these simplifications we tame the scary mathematics! It’s now clear that all we need to do is to keep estimates of the mean and precision of the reward from each socket and then use 2 simple rules to update these values. Things become even clearer when these equations are translated into code.

Gaussian Socket Implementation

The associated code for a Gaussian Thompson sampling socket is shown below. This retains all of the basic functionality we’ve used in previous socket types and adds the parameters and update function for the posterior distribution that is used to model the socket output.

Note that in this update function we’ve replaced the summation over all of the observed rewards with ‘self.n * self.Q‘. This gives us exactly the same value without having to retain the sum of rewards which, as described previously, could potentially grow to an unmanageable size.

The two update functions, given in the equations above, translate into the simple lines of code shown in the update function.

The other main points to note are:

The ‘sample‘ function, instead of simply returning ‘Q‘, the estimate of the socket’s reward value, instead returns a value sampled from the normal distribution that we’re using to model the socket output (the posterior, with mean ‘_self.μ0‘ and precision ‘_self.τ0‘).
As with the Beta distribution, used to model the socket output for binary outputs, we want our prior distribution to start with a distribution that’s close to the uniform distribution, giving a flat probability distribution curve and therefore the potential to return a sampled value with a wide range of possible values. So, initially we set the precision of the posterior to be a very small value (_self.τ0 = 0.0001). In this way sockets that have not yet been tried will be more likely to be selected, much like the Optimistic-Greedy algorithm.

If you look back at the code for the base power socket, when a socket is sampled it returns an amount of charge given by a normal distribution around its true mean value:

In the charge function above, The numpy "randn" function returns a random value from a normal distribution of mean 0 and variance 1. By adding the true socket reward value ‘q‘ to this, we shift the mean to get the distribution to be centred on the actual output of the socket.

In the sample function of the Thompson socket a very similar function can be seen:

Except in this case the normal is centred on _self.μ0, the posterior mean. Additionally, it can be seen that randn is now divided by the square root of the posterior precision _self.τ0. Remember that the precision is just one over the variance and that variance is the standard deviation squared. Therefore dividing by the square root of the precision is identical to multiplying by the standard deviation. This is what changes the width of the distribution, reducing it as more samples occur and we become more confident in our estimated mean.

Experiment Results

As with the Bernoulli experiment done previously, we’ve observed the socket selection over 1000 trials, as shown by the probability density curves below. Due to the posterior distributions starting out with almost flat curves, each of the 5 sockets gets tested once during the first 5 trials. After this, socket 4 (shown as the red curve) dominates the further trials. By the end of the tests it has a tall, thin, curve centred on a value of 12 (the true socket reward value) indicating a high level of confidence in this value.

The only other socket that is tested more than once, over the first 200 tests, is socket 5 (the purple curve, which has a true socket reward of 10). However, it is only tested 3 times and therefore has a small, fat, distribution curve, indicating a low confidence in its value.

👁 Figure 5.3: Thompson sampling using normal distributions for 5 power sockets, with true reward values of 6, 4, 8, 12 and 10 respectively.

Figure 5.3: Thompson sampling using normal distributions for 5 power sockets, with true reward values of 6, 4, 8, 12 and 10 respectively.

From Figure 5.3 its clear to see how Thompson sampling quickly locates and then exploits the best socket, with the other sockets being left largely untested. In this way the algorithm manages to return a large, and nearly optimal, accumulated reward.

Regret

The regret obtained when using the Thompson Sampling algorithm with our standard socket selection problem is shown in Figure 5.4 below.

As was seen for the UCB algorithm, the regret is practically zero, meaning that the best socket was nearly always chosen. This can also be seen in the plot of Cumulative Reward vs Time, in which the actual obtained reward is such a close match for the optimal that it obscured by this curve on the graph. As was seen with the probability density curves for Gaussian Thompson Sampling, the algorithm quickly locks onto the best action and then ruthlessly exploits this, resulting in a very low level of regret.

As with the UCB algorithm, Thompson Sampling can be shown to have logarithmic regret, where the value of the regret falls to almost zero as time progresses.

(Due to the very small number of actions, and the distinct reward values of each socket, this logarithmic decline in the regret isn’t seen in our experiment, since we’re already down near to zero regret.)

👁 Figure 5.4: Thompson sampling regret for the standard socket selection problem.

Figure 5.4: Thompson sampling regret for the standard socket selection problem.

For a more in-depth look at Thompson Sampling, its uses and mathematical framework check out the following:

"[A Tutorial on Thompson Sampling](http://A Tutorial on Thompson Sampling Daniel J. Russo1)" Daniel J. Russo, et al.

Summary

Algorithms that solve the bandit problem need to find a way to balance the trade-off between exploitation and exploration. They need to look for the best actions to take while at the same time trying to make use of the information they’ve already gained.

In simple approaches, such as Epsilon-Greedy, this trade-off is achieved by mainly using the action that currently gives the most reward and adding simple exploration by now and again randomly trying some of the other actions. In more complex solutions, such as UCB, again the actions with the highest mean reward are selected most often but this is balanced by a confidence measure. This ensures that actions that have not been selected often will get tested.

Thompson Sampling takes a different approach to these other methods. Instead of simply maintaining an estimate of the reward, it gradually refines a model of the probability of the reward for each action and actions are chosen by sampling from this distribution. It is therefore possible to get an estimate for the mean reward value of an action, plus a measure of confidence for that estimate. As we saw in our experiments, this allows it to quickly locate and lock onto the optimal action, to give a near optimal accumulated return.

But is Thompson Sampling the best bandit algorithm and, more importantly, is it the one we should use to charge Baby Robot? To know the answers to these questions you’ll need to wait until the final part of this series, when we have the bandit algorithms go head-to-head in a final showdown!

👁 Image

< Part 4: Part 6 >
UCB Bandit Algorithm A Comparison of Bandit Algorithms

Supplement

We’ve looked at the theory behind Thompson Sampling and investigated how it can be used on a couple of simple problems. To see how this can be extended further, to work when the data comes from a normal distribution with unknown mean and variance, check out the article below…

Thompson Sampling using Conjugate Priors

Written By

Steve Roberts

See all from Steve Roberts

Baby Robot Guide, Multi Armed Bandit, Reinforcement Learning, Thompson Sampling

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/thompson-sampling-fc28817eacb8/