Prompt Optimization with Logged Bandit Data
Abstract
We study how to use naturally available user feedback, such as clicks, to optimize a prompt policy for generating sentences with large language models (LLMs). Naive approaches, including regression-based and importance sampling-based ones, suffer either from biased log data or variance caused by the large action space of prompt. To circumvent these challenges, we propose a way to leverage similarity and smoothness in the (generated) sentence embedding space, substantially reducing variance in the policy gradients while maintaining a small bias. Initial experiments on synthetic data demonstrate the effectiveness of our approach. We also plan to publish the extended benchmark and simulator as open-source software.
Cite
Text
Kiyohara et al. "Prompt Optimization with Logged Bandit Data." ICLR 2024 Workshops: DPFM, 2024.Markdown
[Kiyohara et al. "Prompt Optimization with Logged Bandit Data." ICLR 2024 Workshops: DPFM, 2024.](https://mlanthology.org/iclrw/2024/kiyohara2024iclrw-prompt/)BibTeX
@inproceedings{kiyohara2024iclrw-prompt,
title = {{Prompt Optimization with Logged Bandit Data}},
author = {Kiyohara, Haruka and Saito, Yuta and Cao, Daniel Yiming and Joachims, Thorsten},
booktitle = {ICLR 2024 Workshops: DPFM},
year = {2024},
url = {https://mlanthology.org/iclrw/2024/kiyohara2024iclrw-prompt/}
}