ChatterBox: Multimodal Referring and Grounding with Chain-of-Questions

Abstract

In this study, we establish a benchmark and a baseline approach for Multimodal referring and grounding with Chain-of-Questions (MCQ), opening up a promising direction for ‘logical’ multimodal dialogues. The newly collected dataset, named CB-300K, spans challenges including probing dialogues with spatial relationship among multiple objects, consistent reasoning, and complex question chains. The baseline approach, termed ChatterBox, involves a modularized design and a referent feedback mechanism to ensure logical coherence in continuous referring and grounding tasks. This design reduces the risk of referential confusion, simplifies the training process, and presents validity in retaining the language model’s generation ability. Experiments show that ChatterBox demonstrates superiority in MCQ both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with logical interactions.

Cite

Text

Tian et al. "ChatterBox: Multimodal Referring and Grounding with Chain-of-Questions." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I7.32796

Markdown

[Tian et al. "ChatterBox: Multimodal Referring and Grounding with Chain-of-Questions." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/tian2025aaai-chatterbox/) doi:10.1609/AAAI.V39I7.32796

BibTeX

@inproceedings{tian2025aaai-chatterbox,
  title     = {{ChatterBox: Multimodal Referring and Grounding with Chain-of-Questions}},
  author    = {Tian, Yunjie and Ma, Tianren and Xie, Lingxi and Ye, Qixiang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {7401-7409},
  doi       = {10.1609/AAAI.V39I7.32796},
  url       = {https://mlanthology.org/aaai/2025/tian2025aaai-chatterbox/}
}