Unleashing Region Understanding in Intermediate Layers for MLLM-Based Referring Expression Generation
Abstract
The Multi-modal Large Language Model (MLLM) based Referring Expression Generation (REG) task has gained increasing popularity, which aims to generate an unambiguous text description that applies to exactly one object or region in the image by leveraging foundation models. We empirically found that there exists a potential trade-off between the detailedness and the correctness of the descriptions for the referring objects. On the one hand, generating sentences with more details is usually required in order to provide more precise object descriptions. On the other hand, complicated sentences could easily increase the probability of hallucinations. To address this issue, we propose a training-free framework, named ``unleash-then-eliminate'', which first elicits the latent information in the intermediate layers, and then adopts a cycle-consistency-based decoding method to alleviate the production of hallucinations. Furthermore, to reduce the computational load of cycle-consistency-based decoding, we devise a Probing-based Importance Estimation method to statistically estimate the importance weights of intermediate layers within a subset. These importance weights are then incorporated into the decoding process over the entire dataset, intervening in the next token prediction from intermediate layers.Extensive experiments conducted on the RefCOCOg and PHD benchmarks show that our proposed framework could outperform existing methods on both semantic and hallucination-related metrics. Code will be made available in https://github.com/Glupayy/unleash-eliminate.
Cite
Text
Liang et al. "Unleashing Region Understanding in Intermediate Layers for MLLM-Based Referring Expression Generation." Neural Information Processing Systems, 2024. doi:10.52202/079017-3833Markdown
[Liang et al. "Unleashing Region Understanding in Intermediate Layers for MLLM-Based Referring Expression Generation." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/liang2024neurips-unleashing/) doi:10.52202/079017-3833BibTeX
@inproceedings{liang2024neurips-unleashing,
title = {{Unleashing Region Understanding in Intermediate Layers for MLLM-Based Referring Expression Generation}},
author = {Liang, Yaoyuan and Cai, Zhuojun and Xu, Jian and Huang, Guanbo and Wang, Yiran and Liang, Xiao and Liu, Jiahao and Li, Ziran and Wang, Jingang and Huang, Shao-Lun},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-3833},
url = {https://mlanthology.org/neurips/2024/liang2024neurips-unleashing/}
}