Mechanism Design for LLM Fine-Tuning with Multiple Reward Models

Haoran Sun, Yurong Chen, Siwei Wang, Wei Chen, Xiaotie Deng

NeurIPSW 2024

/neuripsw/2024/sun2024neuripsw-mechanism/

Abstract

Fine-tuning large language models (LLMs) to aggregate multiple preferences has attracted considerable research attention. With aggregation algorithms advancing, a potential economic scenario arises where fine-tuning services are provided to agents with different preferences. In this context, agents may benefit from strategically misreporting their preferences, which could affect the fine-tuned outcomes. This paper addresses such incentive issues by framing it as a mechanism design problem: an LLM provider determines the fine-tuning objective (training rule) and the pricing scheme (payment rule) for agents. We primarily focus on a representative class of training rules that maximize social welfare subject to certain regularizations, referred to as \tr\ rules. First, we show that under most circumstances, truthful reporting is sub-optimal with simply a training rule, thereby highlighting the necessity of payments. Second, we design affine maximizer payment rules that implement \tr\ rules in dominant-strategy incentive compatibility (DSIC). Further, we characterize sufficient conditions for payment equivalence properties. For a training rule that satisfies these conditions, we have found all the payment rules that implement it in DSIC, as they only differ by a constant term irrelevant to agents' reports from each other.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Sun et al. "Mechanism Design for LLM Fine-Tuning with Multiple Reward Models." NeurIPS 2024 Workshops: Pluralistic-Alignment, 2024.

Markdown

[Sun et al. "Mechanism Design for LLM Fine-Tuning with Multiple Reward Models." NeurIPS 2024 Workshops: Pluralistic-Alignment, 2024.](https://mlanthology.org/neuripsw/2024/sun2024neuripsw-mechanism/)

BibTeX

@inproceedings{sun2024neuripsw-mechanism,
  title     = {{Mechanism Design for LLM Fine-Tuning with Multiple Reward Models}},
  author    = {Sun, Haoran and Chen, Yurong and Wang, Siwei and Chen, Wei and Deng, Xiaotie},
  booktitle = {NeurIPS 2024 Workshops: Pluralistic-Alignment},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/sun2024neuripsw-mechanism/}
}