HAMMR : HierArchical MultiModal React Agents for Generic VQA

Abstract

The next generation of Visual Question Answering (VQA) systems should handle a broad range of questions over many VQA benchmarks. Therefore we aim to develop a single system for a varied suite of VQA tasks including counting, spatial reasoning, OCR-based reasoning, visual pointing, external knowledge, and more. In this setting, we demonstrate that naively applying a LLM+tools approach using the combined set of all tools leads to poor results. This motivates us to introduce HAMMR: HierArchical MultiModal React. We start from a multimodal ReAct-based system and make it hierarchical by enabling our HAMMR agents to call upon other specialized agents. This enhances the compositionality, which we show to be critical for obtaining high accuracy. On our generic VQA suite, HAMMR outperforms a naive LLM+tools approach by 16.3% and outperforms the generic standalone PaLI-X VQA model by 5.0%.

Cite

Text

Castrejon et al. "HAMMR : HierArchical MultiModal React Agents for Generic VQA." NeurIPS 2024 Workshops: Compositional_Learning, 2024.

Markdown

[Castrejon et al. "HAMMR : HierArchical MultiModal React Agents for Generic VQA." NeurIPS 2024 Workshops: Compositional_Learning, 2024.](https://mlanthology.org/neuripsw/2024/castrejon2024neuripsw-hammr/)

BibTeX

@inproceedings{castrejon2024neuripsw-hammr,
  title     = {{HAMMR : HierArchical MultiModal React Agents for Generic VQA}},
  author    = {Castrejon, Lluis and Mensink, Thomas and Zhou, Howard and Ferrari, Vittorio and Araujo, Andre and Uijlings, Jasper},
  booktitle = {NeurIPS 2024 Workshops: Compositional_Learning},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/castrejon2024neuripsw-hammr/}
}