OURO: A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding

ICCV 2025 pp. 18240-18251

Abstract

Multimodal large models have made significant progress, yet fine-grained understanding of complex scenes remains a challenge. High-quality, large-scale vision-language datasets are essential for addressing this issue. However, existing methods often rely on labor-intensive manual annotations or closed-source models with optimal performance, making large-scale data collection costly. To overcome these limitations, we propose a self-bootstrapped training pipeline that leverages the model's own multimodal capabilities to recursively refine its understanding. By decomposing existing multimodal data into localized sub-regions and generating hierarchical scene descriptions and multi-faceted question-answer pairs, we construct a dataset based on 1.4M image-task instances. We further utilize this dataset to train the base model, significantly enhancing its ability to interpret complex visual scenes and perform various vision-related tasks. Our OURO model, fine-tuned on Qwen2-VL-7B-Instruct using LoRA, achieves substantial improvements over both the base model and similarly-sized counterparts across multiple multimodal benchmarks. Our self-bootstrapped training pipeline offers a novel paradigm for the continuous improvement of multimodal models. Code and datasets are available at https://github.com/tinnel123666888/OURO.git.

Cite

Text

Xu et al. "OURO: A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding." International Conference on Computer Vision, 2025.

Markdown

[Xu et al. "OURO: A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/xu2025iccv-ouro/)

BibTeX

@inproceedings{xu2025iccv-ouro,
  title     = {{OURO: A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding}},
  author    = {Xu, Tianrun and Chen, Guanyu and Li, Ye and Xi, Yuxin and Mu, Zeyu and Wang, Ruichen and Zhang, Tianren and Gao, Haichuan and Chen, Feng},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {18240-18251},
  url       = {https://mlanthology.org/iccv/2025/xu2025iccv-ouro/}
}