Multi-Armed Bandits and Thompson Sampling

From fixed-horizon tests to adaptive allocation and regret minimization

Decision ScienceBanditsThompson SamplingBayesianPython

A/B testing allocates traffic equally between variants and waits for a fixed horizon. This is simple, but it has a cost: every user assigned to the worse variant is a missed opportunity. Multi-armed bandits reframe the problem. Instead of testing first and deploying later, bandits learn and earn simultaneously by shifting traffic toward the better-performing arm as evidence accumulates.

The name comes from a row of slot machines ("one-armed bandits") in a casino. Each machine pays out at an unknown rate. The gambler must decide which machines to play and how often, balancing exploration (trying new machines to learn their rates) against exploitation (playing the machine that looks best so far). This explore-exploit tension appears everywhere: ad placement, recommendation systems, clinical trials, and feature rollouts.

The Explore-Exploit Tradeoff

Pure exploration (uniform random allocation) learns quickly but wastes traffic on bad arms. Pure exploitation (always pick the current leader) converges fast but risks locking onto a suboptimal arm before gathering enough evidence. Every bandit algorithm navigates this tradeoff differently.

The key insight: the optimal balance depends on how much uncertainty remains. Early in the experiment, when posteriors are wide, exploration is cheap because you are almost as likely to gain information from any arm. Late in the experiment, when one arm clearly dominates, continued exploration wastes resources.

Thompson Sampling

Thompson sampling is a Bayesian bandit algorithm with a remarkably simple rule: on each round, sample a value from each arm's posterior distribution, then play the arm whose sample is highest. Arms with high uncertainty get explored because their samples occasionally land above the current leader. Arms with low expected reward get abandoned because their samples rarely win.

For binary outcomes (click/no-click, convert/not), each arm's posterior is a Beta distribution, exactly as in Part 1 (Online Experiments with a Bayesian Lens). Start with Beta(1, 1) priors, observe successes and failures, and update. Thompson sampling turns those posteriors directly into an allocation policy.

The algorithm in pseudocode:

For each arm k, maintain posterior Beta(a_k, b_k).
At each round, draw theta_k ~ Beta(a_k, b_k) for every arm.
Play the arm with the largest theta_k.
Observe the outcome and update the chosen arm's parameters.

Thompson sampling allocation converging toward the best arm as posteriors narrow

Regret: Measuring Bandit Performance

Cumulative regret is the total reward lost by not always playing the best arm. If the best arm has true rate p* and you play arm k at round t, the per-round regret is p* - p_k. Cumulative regret sums these losses over all rounds.

A good bandit algorithm has sublinear regret: the per-round regret shrinks toward zero as the algorithm learns. Thompson sampling achieves logarithmic regret, matching the theoretical lower bound (the Lai-Robbins bound) up to constant factors. In contrast, a fixed-horizon A/B test with equal allocation accumulates linear regret throughout its exploration phase.

Regret Curves: Bandits vs. A/B Test

The following simulation compares cumulative regret across four strategies: uniform random allocation (the A/B test baseline), epsilon-greedy (explore with fixed probability), UCB1 (upper confidence bound), and Thompson sampling.

Cumulative regret curves showing Thompson sampling and UCB1 outperforming uniform allocation

Regret Minimization vs. Hypothesis Testing

A/B tests and bandits optimize for different objectives. A/B tests aim for statistical inference: estimating the treatment effect with controlled error rates (Type I and Type II). Bandits aim for cumulative reward: minimizing regret over the entire allocation period.

This distinction matters. A bandit that shifts traffic aggressively toward one arm may reach lower regret but produce biased estimates of each arm's true rate, because the losing arm receives fewer observations. If the goal is to measure the effect precisely (for a journal paper, a regulatory filing, or a reusable model), a fixed-allocation A/B test with proper power analysis is the right tool. If the goal is to maximize total conversions during the experiment itself (ad serving, homepage optimization), bandits are the better fit.

Other Bandit Algorithms

Thompson sampling is not the only option:

Epsilon-greedy: exploit the best arm with probability 1 - epsilon, explore uniformly with probability epsilon. Simple but wasteful: it explores arms already known to be bad at the same rate as promising unknowns.
UCB1 (Upper Confidence Bound): play the arm with the highest upper confidence bound on its estimated reward. Deterministic and well-analyzed, but less flexible with complex reward models.
Bayesian UCB: like UCB1 but uses the posterior's quantile as the upper bound. Combines the structure of UCB with Bayesian updating.
Contextual bandits: the reward depends on user features (context). LinUCB and neural contextual bandits generalize the multi-armed setting to personalized allocation.

When to Use Bandits vs. A/B Tests

Bandits are well suited when:

The cost of assigning users to a losing variant is high (revenue, user experience).
The number of variants is large and most will be pruned quickly.
The primary goal is cumulative performance, not precise effect estimation.
The environment is non-stationary (reward rates drift over time).

A/B tests are preferable when:

Unbiased effect estimates are required for downstream decisions.
Regulatory or scientific standards demand controlled error rates.
The experiment is short and the cost of equal allocation is small.
Interaction effects, network effects, or interference make adaptive allocation risky.

Many production systems use a hybrid: run a short A/B test to validate the effect, then switch to bandit allocation for ongoing optimization.

Implementation Considerations

Deploying Thompson sampling in production introduces practical concerns:

Batching: in real systems, outcomes arrive in batches (hourly, daily), not one at a time. Batch Thompson sampling updates posteriors once per batch and samples a single allocation vector, trading off exploration granularity for engineering simplicity.
Delayed rewards: if the conversion event occurs days after assignment, the posterior lags behind reality. Use conservative priors and longer update windows.
Multiple metrics: when optimizing for one metric while guarding another, constrained Thompson sampling samples from the posterior of the primary metric and rejects allocations that violate guardrail thresholds.
Non-stationarity: if arm rewards shift over time, use a discounted or sliding-window posterior to downweight old observations.

Looking Ahead

Bandits assume you can randomize: you control which arm each user sees. The next article in this series tackles the harder problem. When randomization is impossible -- because of ethical constraints, organizational inertia, or the treatment already happened -- causal inference from observational data provides tools to estimate effects from non-experimental evidence.

Thompson sampling allocates traffic adaptively, but it does not answer a different question: what happens when you cannot randomize at all? Observational causal inference picks up where experimentation leaves off.

← PreviousThe Secretary Problem and Optimal Stopping Next →Causal Inference from Observational Data

View all articles in Decision Science →

Continue Reading

SeriesDecision Science SeriesNetwork Graph Analysis BrowseAll Articles