Structured Policy Optimization: Enhance Large Vision-Language Model via Self-Referenced Dialogue
Abstract
Preference optimization algorithms typically enhance LLM response quality by leveraging human feedback on multiple answers given a fixed instruction. However, these methods often lack capturing the dynamic nature of conversational exchanges. For large vision-language models (LVLMs), direct preference optimization (DPO) can over-emphasize linguistic nuances while overlooking visual context. To address this challenge, we introduce structured policy optimization (SPO) -- a novel preference optimization method that simultaneously aligns preference instructions, responses, and dialogue interactions to improve multi-modal understanding and reasoning capabilities. The efficacy of SPO is attributed to one key design:treating the questioning and answering as a sequential action and binding them through a trajectory reward. This reward formulation better aligns with real-world dialogue studies and eliminates the need for fixed instructions. We evaluate our models on interleaved benchmarks, including image, multi-image, and video-based understanding and reasoning tasks. Experimental results show that the proposed SPO fine-tuning LVLM with multi-modal preference data can align with human preference more efficiently than DPO.
Cite
Text
Sun et al. "Structured Policy Optimization: Enhance Large Vision-Language Model via Self-Referenced Dialogue." International Conference on Computer Vision, 2025.Markdown
[Sun et al. "Structured Policy Optimization: Enhance Large Vision-Language Model via Self-Referenced Dialogue." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/sun2025iccv-structured/)BibTeX
@inproceedings{sun2025iccv-structured,
title = {{Structured Policy Optimization: Enhance Large Vision-Language Model via Self-Referenced Dialogue}},
author = {Sun, Guohao and Qin, Can and Feng, Yihao and Chen, Zeyuan and Xu, Ran and Dianat, Sohail and Rabbani, Majid and Rao, Raghuveer and Tao, Zhiqiang},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {741-751},
url = {https://mlanthology.org/iccv/2025/sun2025iccv-structured/}
}