A Multimodal Chain of Tools for Described Object Detection
Abstract
Described object detection (DOD) is a promising direction for fine-grained and human-interactive visual recognition, where the goal is to detect target objects based on given language descriptions. Despite significant advancements in language-based object detection, current models still struggle with complex descriptions due to limited compositional understanding. To address this issue, we propose a novel multimodal chain-of-tools (MCoTs) framework that seamlessly integrates specialized tools to handle the two core functionalities of the DOD task: localization and compositional reasoning. Specifically, we decompose the complex DOD task into a series of subtasks, with each subtask handled by specialized tools, including detector and multimodal large language model (MLLM). This simple yet effective MCoTs framework demonstrates significant performance improvements on the challenging D3 benchmark without additional training overhead.
Cite
Text
Park et al. "A Multimodal Chain of Tools for Described Object Detection." NeurIPS 2024 Workshops: Compositional_Learning, 2024.Markdown
[Park et al. "A Multimodal Chain of Tools for Described Object Detection." NeurIPS 2024 Workshops: Compositional_Learning, 2024.](https://mlanthology.org/neuripsw/2024/park2024neuripsw-multimodal/)BibTeX
@inproceedings{park2024neuripsw-multimodal,
title = {{A Multimodal Chain of Tools for Described Object Detection}},
author = {Park, Kwanyong and Lee, Youngwan and Lee, Yong-Ju},
booktitle = {NeurIPS 2024 Workshops: Compositional_Learning},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/park2024neuripsw-multimodal/}
}