Hallucination Reduction in Video-Language Models via Hierarchical Multimodal Consistency
Abstract
The rapid advancement of large language models (LLMs) has led to the widespread adoption of video-language models (VLMs) across various domains. However, VLMs are often hindered by their limited semantic discrimination capability, exacerbated by the limited diversity and biased sample distribution of most video-language datasets. This limitation results in a biased understanding of the semantics between visual concepts, leading to hallucinations. To address this challenge, we propose a Multi-level Multimodal Alignment (MMA) framework that leverages a text encoder and semantic discriminative loss to achieve multi-level alignment. This enables the model to capture both low-level and high-level semantic relationships, thereby reducing hallucinations. By incorporating language-level alignment into the training process, our approach ensures stronger semantic consistency between video and textual modalities. Furthermore, we introduce a two-stage progressive training strategy that exploits larger and more diverse datasets to enhance semantic alignment and better capture general semantic relationships between visual and textual modalities. Our comprehensive experiments demonstrate that the proposed MMA method significantly mitigates hallucinations and achieves state-of-the-art performance across multiple video-language tasks, establishing a new benchmark in the field.
Cite
Text
Dang et al. "Hallucination Reduction in Video-Language Models via Hierarchical Multimodal Consistency." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/1019Markdown
[Dang et al. "Hallucination Reduction in Video-Language Models via Hierarchical Multimodal Consistency." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/dang2025ijcai-hallucination/) doi:10.24963/IJCAI.2025/1019BibTeX
@inproceedings{dang2025ijcai-hallucination,
title = {{Hallucination Reduction in Video-Language Models via Hierarchical Multimodal Consistency}},
author = {Dang, Jisheng and Deng, Shengjun and Chang, Haochen and Wang, Teng and Wang, Bimei and Wang, Shude and Zhu, Nannan and Niu, Guo and Zhao, Jingwen and Liu, Jizhao},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2025},
pages = {9167-9175},
doi = {10.24963/IJCAI.2025/1019},
url = {https://mlanthology.org/ijcai/2025/dang2025ijcai-hallucination/}
}