TaskGalaxy: Scaling Multi-Modal Instruction Fine-Tuning with Tens of Thousands Vision Task Types
Abstract
Multimodal visual language models are gaining prominence in open-world applications, driven by advancements in model architectures, training techniques, and high-quality data. However, their performance is often limited by insufficient task-specific data, leading to poor generalization and biased outputs. Existing efforts to increase task diversity in fine-tuning datasets are hindered by the labor-intensive process of manual task labeling, which typically produces only a few hundred task types. To address this, we propose TaskGalaxy, a large-scale multimodal instruction fine-tuning dataset comprising 19,227 hierarchical task types and 413,648 samples. TaskGalaxy utilizes GPT-4o to enrich task diversity by expanding from a small set of manually defined tasks, with CLIP and GPT-4o filtering those that best match open-source images, and generating relevant question-answer pairs. Multiple models are employed to ensure sample quality. This automated process enhances both task diversity and data quality, reducing manual intervention. Incorporating TaskGalaxy into LLaVA-v1.5 and InternVL-Chat-v1.0 models shows substantial performance improvements across 16 benchmarks, demonstrating the critical importance of task diversity. TaskGalaxy is publicly released at https://github.com/Kwai-YuanQi/TaskGalaxy.
Cite
Text
Chen et al. "TaskGalaxy: Scaling Multi-Modal Instruction Fine-Tuning with Tens of Thousands Vision Task Types." International Conference on Learning Representations, 2025.Markdown
[Chen et al. "TaskGalaxy: Scaling Multi-Modal Instruction Fine-Tuning with Tens of Thousands Vision Task Types." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/chen2025iclr-taskgalaxy/)BibTeX
@inproceedings{chen2025iclr-taskgalaxy,
title = {{TaskGalaxy: Scaling Multi-Modal Instruction Fine-Tuning with Tens of Thousands Vision Task Types}},
author = {Chen, Jiankang and Zhang, Tianke and Liu, Changyi and Ding, Haojie and Shi, Yaya and Cheng.Feng, and Xiao, Huihui and Wen, Bin and Yang, Fan and Gao, Tingting and Zhang, Di},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/chen2025iclr-taskgalaxy/}
}