SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding

Abstract

Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.

Cite

Text

Park et al. "SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Park et al. "SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/park2025icml-second/)

BibTeX

@inproceedings{park2025icml-second,
  title     = {{SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding}},
  author    = {Park, Woohyeon and Kim, Woojin and Kim, Jaeik and Do, Jaeyoung},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {48027-48040},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/park2025icml-second/}
}