Oasis: One Image Is All You Need for Multimodal Instruction Data Synthesis

Abstract

The success of multi-modal large language models (MLLMs) has been largely attributed to the large-scale training data. However, the training data of many MLLMs is unavailable due to privacy concerns. The expensive and labor-intensive process of collecting multi-modal data further exacerbates the problem. Is it possible to synthesize multi-modal training data automatically without compromising diversity and quality? In this paper, we propose a new method, Oasis, to synthesize high-quality multi-modal data with only images. Oasis breaks through traditional methods by prompting only images to the MLLMs, thus extending the data diversity by a large margin. Our method features a delicate quality control method which ensures the data quality. We collected over 500k data and conducted incremental experiments on LLaVA-NeXT. Extensive experiments demonstrate that our method can significantly improve the performance of MLLMs. The image-based synthesis also allows us to focus on the specific-domain ability of MLLMs. Code and data will be publicly available.

Cite

Text

Zhang et al. "Oasis: One Image Is All You Need for Multimodal Instruction Data Synthesis." International Conference on Computer Vision, 2025.

Markdown

[Zhang et al. "Oasis: One Image Is All You Need for Multimodal Instruction Data Synthesis." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/zhang2025iccv-oasis/)

BibTeX

@inproceedings{zhang2025iccv-oasis,
  title     = {{Oasis: One Image Is All You Need for Multimodal Instruction Data Synthesis}},
  author    = {Zhang, Letian and Cui, Quan and Zhao, Bingchen and Yang, Cheng},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {3542-3551},
  url       = {https://mlanthology.org/iccv/2025/zhang2025iccv-oasis/}
}