SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding
Abstract
Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.
Cite
Text
Park et al. "SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Park et al. "SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/park2025icml-second/)BibTeX
@inproceedings{park2025icml-second,
title = {{SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding}},
author = {Park, Woohyeon and Kim, Woojin and Kim, Jaeik and Do, Jaeyoung},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {48027-48040},
volume = {267},
url = {https://mlanthology.org/icml/2025/park2025icml-second/}
}