DAMO: Decoding by Accumulating Activations Momentum for Mitigating Hallucinations in Vision-Language Models
Abstract
Large Vision-Language Models (VLMs) exhibit significant potential in multimodal tasks but often struggle with hallucinations—responses that are plausible yet visually ungrounded. In this work, we investigate the layer-wise prediction tendencies of VLMs and conduct an in-depth analysis of their decoding mechanism. We observe that VLMs tend to ``overthink'' during the final stages of decoding, making significant prediction shifts in the last few layers often favoring incorrect results, which leads to a surge in hallucinative outputs. Leveraging this localized pattern, we propose a novel decoding strategy inspired by the momentum analogy used in gradient descent-based optimizers. Our method enforces decoding consistency across layers in an adaptive manner during forward passes—an under-explored approach in existing works. This strategy significantly improves the reliability and performance of VLMs in various multimodal tasks, while introducing only negligible efficiency overhead.
Cite
Text
Wang et al. "DAMO: Decoding by Accumulating Activations Momentum for Mitigating Hallucinations in Vision-Language Models." International Conference on Learning Representations, 2025.Markdown
[Wang et al. "DAMO: Decoding by Accumulating Activations Momentum for Mitigating Hallucinations in Vision-Language Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/wang2025iclr-damo/)BibTeX
@inproceedings{wang2025iclr-damo,
title = {{DAMO: Decoding by Accumulating Activations Momentum for Mitigating Hallucinations in Vision-Language Models}},
author = {Wang, Kaishen and Gu, Hengrui and Gao, Meijun and Zhou, Kaixiong},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/wang2025iclr-damo/}
}