Structured Policy Optimization: Enhance Large Vision-Language Model via Self-Referenced Dialogue

Abstract

Preference optimization algorithms typically enhance LLM response quality by leveraging human feedback on multiple answers given a fixed instruction. However, these methods often lack capturing the dynamic nature of conversational exchanges. For large vision-language models (LVLMs), direct preference optimization (DPO) can over-emphasize linguistic nuances while overlooking visual context. To address this challenge, we introduce structured policy optimization (SPO) -- a novel preference optimization method that simultaneously aligns preference instructions, responses, and dialogue interactions to improve multi-modal understanding and reasoning capabilities. The efficacy of SPO is attributed to one key design:treating the questioning and answering as a sequential action and binding them through a trajectory reward. This reward formulation better aligns with real-world dialogue studies and eliminates the need for fixed instructions. We evaluate our models on interleaved benchmarks, including image, multi-image, and video-based understanding and reasoning tasks. Experimental results show that the proposed SPO fine-tuning LVLM with multi-modal preference data can align with human preference more efficiently than DPO.

Cite

Text

Sun et al. "Structured Policy Optimization: Enhance Large Vision-Language Model via Self-Referenced Dialogue." International Conference on Computer Vision, 2025.

Markdown

[Sun et al. "Structured Policy Optimization: Enhance Large Vision-Language Model via Self-Referenced Dialogue." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/sun2025iccv-structured/)

BibTeX

@inproceedings{sun2025iccv-structured,
  title     = {{Structured Policy Optimization: Enhance Large Vision-Language Model via Self-Referenced Dialogue}},
  author    = {Sun, Guohao and Qin, Can and Feng, Yihao and Chen, Zeyuan and Xu, Ran and Dianat, Sohail and Rabbani, Majid and Rao, Raghuveer and Tao, Zhiqiang},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {741-751},
  url       = {https://mlanthology.org/iccv/2025/sun2025iccv-structured/}
}