Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs
Abstract
Multi-modal Large Language Models (MLLMs) excel at single-image tasks but struggle with multi-image understanding due to cross-modal misalignment, leading to hallucinations (context omission, conflation, and misinterpretation). Existing methods using Direct Preference Optimization (DPO) constrain optimization to a solitary image reference within the input sequence, neglecting holistic context modeling. To address this, we propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework that enhances per-image perception in multi-image settings by zooming into visual clues—from sequential context to local details. Our approach features two sequentially dependent components: (i) Context-Level Optimization: By introducing low-cost sequence preference pairs, we optimize the model to distinguish between complete and disrupted multi-image contexts, thereby correcting cognitive biases in MLLMs’ multi-image understanding. (ii) Needle-Level Optimization: By integrating region-specific visual prompts with multimodal preference supervision, we direct the model’s attention to critical visual details, effectively suppressing perceptual biases toward fine-grained visual information. To support scalable optimization, we also construct MultiScope-42k, an automatically generated multi-image dataset with hierarchical preference pairs. Experiments show that CcDPO significantly reduces hallucinations and yields consistent performance gains across general single- and multi-image tasks. Codes are available at https://github.com/LXDxmu/CcDPO.
Cite
Text
Li et al. "Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs." Advances in Neural Information Processing Systems, 2025.Markdown
[Li et al. "Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/li2025neurips-zooming/)BibTeX
@inproceedings{li2025neurips-zooming,
title = {{Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs}},
author = {Li, Xudong and Zhang, Mengdan and Chen, Peixian and Zheng, Xiawu and Zhang, Yan and Zheng, Jingyuan and Shen, Yunhang and Li, Ke and Fu, Chaoyou and Sun, Xing and Ji, Rongrong},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/li2025neurips-zooming/}
}