In-Context Multi-Armed Bandits via Supervised Pretraining

Abstract

Exploring the in-context learning capabilities of large transformer models, this research focuses on decision-making within reinforcement learning (RL) environments, specifically multi-armed bandit problems. We introduce the Reward Weighted Decision-Pretrained Transformer (DPT-RW), a model that uses straightforward supervised pretraining with a reward-weighted imitation learning loss. The DPT-RW predicts optimal actions by evaluating a query state and an in-context dataset across varied tasks. Surprisingly, this simple approach produces a model capable of solving a wide range of RL problems in-context, demonstrating online exploration and offline conservatism without specific training in these areas. A standout observation is the optimal performance of the model in the online setting, despite being trained on data generated from suboptimal policies and not having access to optimal data.

Cite

Text

Zhang et al. "In-Context Multi-Armed Bandits via Supervised Pretraining." NeurIPS 2023 Workshops: FMDM, 2023.

Markdown

[Zhang et al. "In-Context Multi-Armed Bandits via Supervised Pretraining." NeurIPS 2023 Workshops: FMDM, 2023.](https://mlanthology.org/neuripsw/2023/zhang2023neuripsw-incontext/)

BibTeX

@inproceedings{zhang2023neuripsw-incontext,
  title     = {{In-Context Multi-Armed Bandits via Supervised Pretraining}},
  author    = {Zhang, Fred Weiying and Ye, Jiaxin and Yang, Zhuoran},
  booktitle = {NeurIPS 2023 Workshops: FMDM},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/zhang2023neuripsw-incontext/}
}