A Practical Algorithm for Feature-Rich, Non-Stationary Bandit Problems
Abstract
Contextual bandits are incredibly useful in many practical problems. We go one step further by devising a more realistic problem that combines: (1) contextual bandits with dense arm features, (2) non-linear reward functions, and (3) a generalization of correlated bandits where reward distributions change over time but the degree of correlation maintains. This formulation lends itself to a wider set of applications such as recommendation tasks. To solve this problem, we introduce *conditionally coupled contextual* ($C_3$) Thompson sampling for Bernoulli bandits. It combines an improved Nadaraya-Watson estimator on an embedding space with Thompson sampling that allows online learning without retraining. Empirical results show that $C_3$ outperforms the next best algorithm by 5.7% lower average cumulative regret on four OpenML tabular datasets as well as demonstrating a 12.4% click lift on Microsoft News Dataset (MIND) compared to other algorithms.
Cite
Text
Loh et al. "A Practical Algorithm for Feature-Rich, Non-Stationary Bandit Problems." Transactions on Machine Learning Research, 2026.Markdown
[Loh et al. "A Practical Algorithm for Feature-Rich, Non-Stationary Bandit Problems." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/loh2026tmlr-practical/)BibTeX
@article{loh2026tmlr-practical,
title = {{A Practical Algorithm for Feature-Rich, Non-Stationary Bandit Problems}},
author = {Loh, William and Sinha, Sajib Kumer and Agarwal, Ankur and Poupart, Pascal},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/loh2026tmlr-practical/}
}