Prompt Tuning Decision Transformers with Structured and Scalable Bandits

Abstract

Prompt tuning has emerged as a key technique for adapting large pre-trained Decision Transformers (DTs) in offline Reinforcement Learning (RL), particularly in multi-task and few-shot settings. The Prompting Decision Transformer (PDT) enables task generalization via trajectory prompts sampled uniformly from expert demonstrations -- without accounting for prompt informativeness. In this work, we propose a bandit-based prompt-tuning method that learns to construct optimal trajectory prompts from demonstration data at inference time. We devise a structured bandit architecture operating in the trajectory prompt space, achieving linear rather than combinatorial scaling with prompt size. Additionally, we show that the pre-trained PDT itself can serve as a powerful feature extractor for the bandit, enabling efficient reward modeling across various environments. We theoretically establish regret bounds and demonstrate empirically that our method consistently enhances performance across a wide range of tasks, high-dimensional environments, and out-of-distribution scenarios, outperforming existing baselines in prompt tuning.

Cite

Text

Rietz et al. "Prompt Tuning Decision Transformers with Structured and Scalable Bandits." Advances in Neural Information Processing Systems, 2025.

Markdown

[Rietz et al. "Prompt Tuning Decision Transformers with Structured and Scalable Bandits." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/rietz2025neurips-prompt/)

BibTeX

@inproceedings{rietz2025neurips-prompt,
  title     = {{Prompt Tuning Decision Transformers with Structured and Scalable Bandits}},
  author    = {Rietz, Finn and Smirnov, Oleg and Karimi, Sara and Cao, Lele},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/rietz2025neurips-prompt/}
}