LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Liu, Shilong; Cheng, Hao; Liu, Haotian; Zhang, Hao; Li, Feng; Ren, Tianhe; Zou, Xueyan; Yang, Jianwei; Su, Hang; Zhu, Jun; Zhang, Lei; Gao, Jianfeng; Li, Chunyuan

doi:10.1007/978-3-031-72970-6_8

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, Chunyuan Li

ECCV 2024

doi:10.1007/978-3-031-72970-6_8 /eccv/2024/liu2024eccv-llavaplus/

Abstract

This paper presents (), a general-purpose multimodal assistant trained using an end-to-end approach that systematically expands the capabilities of large multimodal models (LMMs). maintains a skill repository that contains a wide range of vision and vision-language pre-trained models (tools), and is able to activate relevant tools, given users’ multimodal inputs, to compose their execution results on the fly to fulfill many real-world tasks. To acquire the ability of using tools, is trained on multimodal instruction-following data that we have curated. The training data covers many tool use examples of visual understanding, generation, external knowledge retrieval and their compositions. Empirical results show that outperforms LLaVA in existing capabilities, and exhibits many new capabilities. Compared with tool-augmented LLMs, is distinct in that the image query is directly grounded in and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.

PDF ECCV Semantic Scholar

Cite

Text

Liu et al. "LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72970-6_8

Markdown

[Liu et al. "LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/liu2024eccv-llavaplus/) doi:10.1007/978-3-031-72970-6_8

BibTeX

@inproceedings{liu2024eccv-llavaplus,
  title     = {{LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents}},
  author    = {Liu, Shilong and Cheng, Hao and Liu, Haotian and Zhang, Hao and Li, Feng and Ren, Tianhe and Zou, Xueyan and Yang, Jianwei and Su, Hang and Zhu, Jun and Zhang, Lei and Gao, Jianfeng and Li, Chunyuan},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72970-6_8},
  url       = {https://mlanthology.org/eccv/2024/liu2024eccv-llavaplus/}
}