Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Abstract

Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However MLLMs still face a fundamental limitation of hallucinations where they tend to generate erroneous or fabricated information. In this paper we address hallucinations in MLLMs from a novel perspective of representation learning. We first analyzed the representation distribution of textual and visual tokens in MLLM revealing two important findings: 1) there is a significant gap between textual and visual representations indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled making it challenging to distinguish them. These two observations inspire us with a simple yet effective method to mitigate hallucinations. Specifically we introduce contrastive learning into MLLMs and use text with hallucination as hard negative examples naturally bringing representations of non-hallucinative text and visual samples closer while pushing way representations of non-hallucinating and hallucinative text. We evaluate our method quantitatively and qualitatively showing its effectiveness in reducing hallucination occurrences and improving performance across multiple benchmarks. On the MMhal-Bench benchmark our method obtains a 34.66% /29.5% improvement over the baseline MiniGPT-4/LLaVA. Our code is available on https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl.

Cite

Text

Jiang et al. "Hallucination Augmented Contrastive Learning for Multimodal Large Language Model." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02553

Markdown

[Jiang et al. "Hallucination Augmented Contrastive Learning for Multimodal Large Language Model." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/jiang2024cvpr-hallucination/) doi:10.1109/CVPR52733.2024.02553

BibTeX

@inproceedings{jiang2024cvpr-hallucination,
  title     = {{Hallucination Augmented Contrastive Learning for Multimodal Large Language Model}},
  author    = {Jiang, Chaoya and Xu, Haiyang and Dong, Mengfan and Chen, Jiaxing and Ye, Wei and Yan, Ming and Ye, Qinghao and Zhang, Ji and Huang, Fei and Zhang, Shikun},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {27036-27046},
  doi       = {10.1109/CVPR52733.2024.02553},
  url       = {https://mlanthology.org/cvpr/2024/jiang2024cvpr-hallucination/}
}