Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models
Abstract
Hallucination remains a significant challenge in Large Vision-Language Models (LVLMs). To alleviate this issue, some methods, known as contrastive decoding, induce hallucinations by manually disturbing the raw vision or instruction inputs and then mitigate them by contrasting the outputs of the original and disturbed LVLMs. However, these holistic input disturbances sometimes induce potential noise and also double the inference cost. To tackle these issues, we propose a simple yet effective method named $\textit{Self-Introspective Decoding}$ (SID). Our empirical investigations reveal that pre-trained LVLMs can introspectively assess the importance of vision tokens based on preceding vision and text (both instruction and generated) tokens. Leveraging this insight, we develop the Context and Text-aware Token Selection (CT$^2$S) strategy, which preserves only the least important vision tokens after the early decoder layers, thereby adaptively amplify vision-and-text association hallucinations during auto-regressive decoding. This strategy ensures that multimodal knowledge absorbed in the early decoder layers induces multimodal contextual rather than aimless hallucinations, and significantly reduces computation burdens. Subsequently, the original token logits subtract the amplified fine-grained hallucinations, effectively alleviating hallucinations without compromising the LVLMs' general ability. Extensive experiments illustrate SID generates less-hallucination and higher-quality texts across various metrics, without much additional computation cost.
Cite
Text
Huo et al. "Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models." International Conference on Learning Representations, 2025.Markdown
[Huo et al. "Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/huo2025iclr-selfintrospective/)BibTeX
@inproceedings{huo2025iclr-selfintrospective,
title = {{Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models}},
author = {Huo, Fushuo and Xu, Wenchao and Zhang, Zhong and Wang, Haozhao and Chen, Zhicheng and Zhao, Peilin},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/huo2025iclr-selfintrospective/}
}