## What is Multi-Armed Bandit?

A multi-armed bandit (MAB) can refer to the multi-armed bandit problem or an algorithm that solves this problem with a certain efficiency. The name comes from an illustration of the problem in which a gambler is presented with two or more slot machines and he can pull the arm (lever) on each machine and observe the reward it gives. If one of the machines is rigged to produce rewards with a higher probability, the task is to establish that while spending as little cash as possible. In an A/B testing scenario this would be equivalent to establishing if the treatment arm or the control arm of an A/B test result in higher probability for conversion, purchase, etc.

The crucial trade-off in the multi-armed bandit problem is the balance between **exploration and "exploitation"**: if a test variant is performing worse than the control, for how long should one continue serving this variant to test users? If the reverse is true, for how long to retain the control and what proportion of users to send to it? In this sense a multi-armed bandit is an adaptive sequential design, thus sharing their **sub-optimal performance** versus ordinary sequential testing designs.

While a multi-armed bandit algorithm shares some **assumptions** with a classic parametric test such as the requirement for the data from two arms to be independent and identically distributed, as well as constant mean and constant variance, they, importantly, add the assumption that there is no delay between allocating a user into a test variant and observing the outcome for that user. This last assumption is especially problematic in an online controlled experiment of the most typical type in which the time delay between a user being allocated to an arm and the user performing an action can be hours, days or even weeks. While there has been work on the problem of delayed feedback, it remains an active area of research. The assumption can be less problematic in situations when the parameter of interest is a click-through rate and the click is expected to happen very quickly after the impression.

The constant mean and variance assumptions affects a bandit algorithm significantly more than an A/B test with a reasonable duration. A further issue is the interpretation of the results of bandit algorithms, especially with regards to generalizability (external validity).

A further limitation is present when one has more than one primary KPI with a composite KPI being one solution, but only when it is possible and one doesn't care about which particular KPI leads to a change in the composite.

Like this glossary entry? For an in-depth and comprehensive reading on A/B testing stats, check out the book "Statistical Methods in Online A/B Testing" by the author of this glossary, Georgi Georgiev.