6533b82cfe1ef96bd128fdb7

RESEARCH PRODUCT

Thompson Sampling for Dynamic Multi-armed Bandits

Ole-christopher GranmoNeha GuptaAshok K. Agrawala

subject

Computer Science::Machine LearningMathematical optimizationbusiness.industryComputer scienceOrder statisticBayesian probabilitySampling (statistics)RegretArtificial intelligencebusinessThompson samplingRandom variableSelection (genetic algorithm)

description

The importance of multi-armed bandit (MAB) problems is on the rise due to their recent application in a large variety of areas such as online advertising, news article selection, wireless networks, and medicinal trials, to name a few. The most common assumption made when solving such MAB problems is that the unknown reward probability theta k of each bandit arm k is fixed. However, this assumption rarely holds in practice simply because real-life problems often involve underlying processes that are dynamically evolving. In this paper, we model problems where reward probabilities theta k are drifting, and introduce a new method called Dynamic Thompson Sampling (DTS) that facilitates Order Statistics based Thompson Sampling for these dynamically evolving MABs. The DTS algorithm adapts its success probability estimates, hat theta k, faster than traditional Thompson Sampling schemes and thus leads to improved performance in terms of lower regret. Extensive experiments demonstrate that DTS outperforms current state-of-the-art approaches, namely pure Thompson Sampling, UCB-Normal and UCB_f, for the case of dynamic reward probabilities. Furthermore, this performance advantage increases persistently with the number of bandit arms.

https://doi.org/10.1109/icmla.2011.144